[01:37:20] 10netops, 10Operations: Juniper HA audit - https://phabricator.wikimedia.org/T191667#4122151 (10ayounsi) [06:52:10] 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4122392 (10Vgutierrez) Fixed following @Volans recommendations: ``` vgutierrez@neodymium:~$ sudo cumin 'R:class%site = eqsin' 'ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volati... [06:52:31] 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4122393 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [07:12:49] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4120349 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5003.eqsin.wmnet ``` The log can be found in `/var/lo... [07:13:11] yey.. now it's working :D [07:23:42] \o/ [07:37:19] mmh we forgot to remove the last varnishxcache leftover (/etc/nagios/nrpe.d/check_varnishxcache.cfg). Doing that now with cumin. [07:37:31] ouch [07:37:35] thx [07:37:36] 07:37:01 | lvs5003.eqsin.wmnet | Still waiting for reboot after 15.0 minutes [07:37:39] hmmm [07:39:19] suspicious [07:39:31] (2nd reboot) [07:39:38] let's see the console.. [07:40:07] 15 mins is within the usual time it takes, though [07:40:17] first reboot was 9 minutes [07:46:18] vgutierrez: how does it look like in console? [07:48:43] initramfs prompt [07:48:59] (I was figuring out how to get on the console) [07:50:47] vgutierrez: I've figured that out so many times that I wrote this https://wikitech.wikimedia.org/wiki/User:Ema/Remote_Access [07:52:14] ALERT! /dev/disk/by-uuid/bdab6aed-fc96-450b-b073-0b6515aa5168 does not exist [07:52:25] root=UUID=bdab6aed-fc96-450b-b073-0b6515aa5168 [07:54:09] mmh [07:54:10] vgutierrez: https://phabricator.wikimedia.org/T149845#3906167 [07:54:39] that seems related [07:56:23] yup [07:56:29] mdadm --assemble --scan fixes the issue [07:58:17] the ticket explicitly mentions jessie though, it might be worth mentioning that the issue isn't gone with stretch [07:59:39] vgutierrez: also see ebe9658 [08:01:07] Jaime also ran into it on stretch: https://phabricator.wikimedia.org/T149845#3922449 [08:07:30] ema: hmmm well.. somehow lvs5003 got reimaged as jessie :/ [08:08:47] meanwhile, eqsin is depooled due to T191940 [08:08:47] T191940: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940 [08:09:13] I'd say we should keep it so while lvs5003 is down [08:10:42] so... stretch is the default distribution.. and I ran puppet on install1002 after merging https://gerrit.wikimedia.org/r/#/c/425475/1/modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [08:11:41] but I'm obviously missing something [08:14:52] ema: eqsin could be using install2002 instead of install1002? [08:17:06] vgutierrez: stupid qs - super sure that the pxe boot started after install1002 was updated? [08:17:30] elukey: I ran puppet manually on install1002 and launched the auto-reimage script after that [08:17:40] very weird [08:17:40] so.. 90% sure [08:17:42] 1002 is the install server for all, 2002 is just a failover [08:17:51] ack [08:19:10] vgutierrez: mmh, /etc/dhcp/linux-host-entries.ttyS1-115200 on install1002 still mentions jessie for lvs5002 [08:19:25] s/5002/5003/ [08:19:40] vgutierrez@install1002:~$ grep -A 4 lvs5003 /etc/dhcp/linux-host-entries.ttyS1-115200 [08:19:43] host lvs5003 { [08:19:46] hardware ethernet F4:E9:D4:D0:77:40; [08:19:46] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [08:19:48] fixed-address lvs5003.eqsin.wmnet; [08:19:51] } [08:19:53] wut= [08:19:56] ? [08:20:18] yeah, ignore me, I was looking at the entry for cp5003 instead [08:20:33] :* [08:20:35] coffee++ [08:20:58] so I guess I'll have to reimage it again while looking at the console [08:21:02] so timings in syslog for puppet run + dhcp activity for lvs5003 seems to be ok [08:22:48] vgutierrez: I'd try another time as you suggested, maybe you can just force a pxe boot and see how it goes? [08:23:50] sure [08:25:26] nice timing.. the auto-image script just rebooted lvs5003 [08:27:35] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122514 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5003.eqsin.wmnet'] ``` Of which those **FAILED**: ``` ['lvs5003.eqsin.wmnet'] ``` [08:27:42] elukey: I'm searching for the dhcp activity logs without luck. Where are they? [08:28:21] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122515 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5003.eqsin.wmnet ``` The log can be found in `/var/lo... [08:30:44] ema: I grepped "10.132.0.13" /var/log/syslog [08:30:49] on install1002 [08:33:06] it's loading the installer again. so let's see [08:33:52] elukey: thanks :) [08:34:47] so, there's this for eqsin on install1002's /etc/dhcp/dhcpd.conf: [08:34:49] next-server 103.102.166.7; # bast5001 (tftp server) [08:35:18] yup.. every dc uses their own tftp server apparently [08:35:28] can it be that puppet hadn't run on bast5001 yet at the point of the reimage, and that's why lvs5003 was booted into the jessie installer? [08:35:57] hmmm the default tftp path is set to stretch on the dhcp server [08:41:48] yeah [08:42:16] atftpd.service's logs are not particularly useful BTW [08:42:24] Apr 11 08:30:09 bast5001 atftpd[759]: Serving lpxelinux.0 to 10.132.0.13:2071 [08:42:43] well at least right now it's using debian 9 installer [08:42:57] maybe I didn't wait enough between the puppet run and the PXE reboot [08:43:00] :/ [08:44:16] possibly. It would be nice if atftp could log things like "Serving jessie-installer/pxelinux.0 to [...]" :) [08:44:51] yup [08:46:41] rebooting on the brand new OS [08:47:05] let's see what happens now with the mdadm RAID [08:48:16] nice boot :D [08:58:27] well... at least I was able to predict the NIC interface name successfully, enp5s0f0 [08:58:39] nice [08:59:03] so did it boot without mdadm issues this time? [08:59:12] yup [08:59:25] like a charm and with the expected debian version [09:01:10] \o/ [09:02:14] I'm gonna bounce pybal on lvs1003 to enable UDP monitoring there as well, lvs1006 had no issues at all throughout the EU night [09:02:24] nice [09:08:19] wmt-auto-reimage messes with downtimes even with --no-downtime option [09:08:25] :/ [09:08:52] vgutierrez: what do you mean? [09:09:16] so before I began reimaging lvs5003 I manually set a 4 hours downtime [09:09:21] ok [09:09:28] and ran the wmf-auto-reimage script with the --no-downtime option [09:09:50] ack [09:09:57] after the first puppet run, the script set another downtime on icinga [09:09:59] 08:24:26 | lvs5003.eqsin.wmnet | Downtimed on Icinga [09:10:02] * volans making the suspension grow [09:10:17] that's normal [09:10:37] you'll ask... why on earth? [09:10:50] * ema grabs the popcorn [09:10:58] nah.. I'm jinxed with all the reimage stuff of lvs5003 [09:11:07] I expect an earthquake on eqsin in 5 minutes [09:11:28] the answer is simple, to reimage we have to do a puppet node clean/deactivate that will delete all the exported resources and at the next run of puppet on the icinga server will delete all the checks [09:11:58] so when the new exported resources will be added, later on those are not covered by the old disappeared downtime [09:12:21] so the reimage script forces a run of puppet on the icinga server and immediately set a downtime for the host [09:12:47] ha! [09:12:52] *but* it does it sequentially, so only *after* the first puppet run, often puppet runs independently on the icinga server in the meanwhile [09:12:58] and loads some checks already exported [09:13:00] but not all [09:13:23] so there is no real solution, unless running puppet in parallel on the icinga server like every 2 minutes... [09:13:34] that doesn't seem a sound solution to me [09:13:40] understood [09:13:41] vgutierrez: hey I've ssh'ed onto lvs5003 \o/ [09:13:50] ema: now I can blame you on everything [09:13:54] O:) [09:14:14] it's going to be rebooted soon [09:14:18] puppet first run just finished [09:14:58] cool [09:14:58] vgutierrez: what the --no-downtime does is to not set the downtime *before* the reimage, historically due to T145192 [09:14:59] T145192: icinga-downtime script waiting forever if host already in downtime - https://phabricator.wikimedia.org/T145192 [09:15:16] volans: ack [09:15:30] BTW, now that we're discussing the wmf-auto-reimage script [09:15:59] * volans hides [09:16:30] if ipmi fails on the first run, it tries to log into phabricator before the phab client has been properly initialized [09:16:52] wut? do you have a stack trace? [09:17:03] yup [09:18:01] get_phabricator_client is the first thing [09:18:21] feel free to open a task with the Operations-Software-Development tag [09:19:14] vgutierrez: reimage script or reimage-host script? I bet the latter [09:19:43] reimage-host [09:19:49] yeah, found the issue [09:19:55] fix in few minutes [09:19:57] thanks for the report [09:20:24] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122595 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5003.eqsin.wmnet'] ``` and were **ALL** successful. [09:20:35] I bet the stack is line 232 phab_client is not defined [09:22:16] sigh.. it wasn't logged [09:22:17] sorry [09:24:59] vgutierrez: https://gerrit.wikimedia.org/r/#/c/425495/ [09:25:00] ema: so far lvs5003 looks good.. pybal up & running, lo interface properly tagged, ipvsadm looks sane... [09:25:04] I think should be it [09:25:31] see line 200-204 of pre-existing code [09:25:45] ema: let's check manually the interface tweaks... [09:27:50] something weird happening with augeas... [09:28:49] https://phabricator.wikimedia.org/P6980 [09:29:10] interface-rps and txqlen lines are duplicated [09:37:41] vgutierrez: not sure if related, but augeas-tools is not installed on lvs5003 [09:38:42] oh well it's not installed on lvs5001 either :) [09:47:46] vgutierrez: I've installed augeas-tools on both 5003 and 5001 to compare their output, nothing extraordinary really [09:49:29] sigh.. this is going to be funny [09:49:38] I just removed manually the dupped lines and re-ran puppet [09:49:54] so this time /etc/network/interfaces was updated as expected [09:49:56] no dupped lines [09:50:07] running puppet again [09:51:47] still no duplicates [09:52:24] yet another corner case issue [09:53:16] those lines are being injected by interface::up_command [09:53:33] changes => "set up[last()+1] '${command}'", [09:53:33] onlyif => "match up[. = '${command}'] size == 0"; [09:53:41] that onlyif should avoid what we are seeing, right? [09:54:30] ./modules/interface/manifests/up_command.pp [10:04:01] so, when puppet ran at 09:12:25 there already was a duplicate entry for the echo 10000 command (see /var/log/puppet.log) [10:07:51] hmmm [10:08:01] that log doesn't say when that line was added [10:13:14] anyways, other than the duplicate lines there seems to be another interesting difference in /etc/network/interfaces compared to lvs500[12] [10:13:19] on lvs5003: dns-nameservers 103.102.166.254 208.80.153.254 [10:13:33] on lvs5002: dns-nameservers 208.80.154.254 208.80.153.254 [10:14:06] (lvs5001 also has dns-nameservers 208.80.154.254 208.80.153.254) [10:15:29] /etc/resolv.conf is the same on all lvs-eqsin hosts though [10:17:01] hmm [10:17:34] I cannot find any reference to dns-nameservers on puppet [10:18:01] * volans shyly asks himself if puppetboard might be useful for this debugging [10:18:22] vgutierrez: modules/install_server/files/autoinstall/subnets/{public1,private1}-eqsin.cfg [10:18:29] (netcfg/get_nameservers) [10:19:41] so it looks like lvs5002 has deprecated config [10:19:43] ok so the difference is because of 71bde69 [10:20:03] lvs500[12] were most likely installed before that commit? [10:20:17] looks like that yes [10:21:17] note that 103.102.166.254 can't be used by lvs5003 for name resolution (that's the recdns service IP lvs5003 has on the loopback interface) [10:22:59] anyways, resolvconf isn't installed so probably nothing much to see here [10:24:18] yup.. /etc/resolv.conf has the proper ones as you pointed out [10:26:54] I was thinking... [10:27:12] what's the reason behind having different pybal packages for jessie/stretch? [10:27:18] volans: it's funny but I logged right now for the first time since I work here on puppetboard [10:27:26] it's just python code, so the same .deb should work on both [10:27:28] ema: my dumbness regarding apt I guess [10:27:46] vgutierrez: it's in prod since ~2 weeks ;) [10:27:50] so not that strange [10:27:58] I packaged 1.15.3 for stretch ema [10:28:36] and you're probably right, we don't need to differentiate them [10:28:45] vgutierrez: ok, I think there's no need to do that. We can just copy the .deb into the stretch version of the archive [10:29:05] volans: puppetboard? what is that :) [10:29:20] I see that you follow very much our quarterly goals :-P [10:29:39] puppetdb front-end [10:30:27] but that's awesome [10:30:54] https://puppetboard.wikimedia.org/node/lvs5003.eqsin.wmnet [10:30:59] thanks volans :) [10:31:16] you can see the changes in the recent puppet runs [10:33:27] ema: so.. what we think about lvs5003? [10:33:41] augeas messing up with us aside :) [10:34:58] vgutierrez: I think that it looks fine! [10:35:57] vgutierrez: perhaps we should re-enable bgp, stop pybal on lvs5001, and test a few requests now that eqsin is depooled? [10:36:16] indeed [10:36:20] let's stop puppet? [10:36:27] or just commit the hiera change? [10:36:52] I'd say hiera [10:36:53] BTW, cool stuff: https://puppetboard.wikimedia.org/catalogs/compare/lvs5003.eqsin.wmnet...lvs5002.eqsin.wmnet? [10:37:16] you can filter by interface::rps for instance [10:37:30] ema: ack [10:39:24] if it only was properly sorted :( [10:39:32] volans: indeed [10:40:04] and highlight the differecens ;) [10:40:33] ema: there you go https://gerrit.wikimedia.org/r/#/c/425508/ [10:40:58] (CR ping-pong) [10:41:30] volans: foreman is still a thing? [10:41:40] (regarding puppetdb frontends) [10:42:56] vgutierrez: maybe add Bug: T177961 ? [10:42:57] T177961: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961 [10:43:00] mmmh what do you mean? IIRC foreman is used in the opposite side, you manage stuff on foreman and it has a plugin to do stuff on puppetdb, but I might have misunderstood your question [10:43:05] ema: duh :) [10:43:46] T191897 [10:43:46] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:43:54] 177961 is the global one [10:45:21] Line 4: Unexpected blank line [10:45:25] thanks jenkins! [10:45:31] :) [10:48:24] I'm still fascinated by eqsin<->codfw rtt being lower than eqsin<->any_other_dc [10:50:36] https://puppetboard.wikimedia.org/report/lvs5003.eqsin.wmnet/be2222d6405460fa66b966d231d3fe8ddd7283df [10:50:40] bgp enabled :) [10:50:52] now we can even link puppet changes easily [10:51:08] for the last ~2 days IIRC the config [10:51:31] vgutierrez: restarting pybal on lvs5003 [10:52:09] ema: you just scared me [10:52:23] ema: I ran a systemctl status pybal while you were restarting it [10:52:29] uh, sorry! [10:52:31] so I got all the typical errors [10:52:44] nah nah, don't worry [10:53:16] vgutierrez: shall we stop pybal on lvs5001 and test a few requests? [10:53:46] bgp looks good? [10:54:22] I think so: BGP session established for ASN 64600 [...] [10:54:50] and other pleasant-looking bgp messages [10:56:39] 103.102.166.254/32 *[BGP/170] 11:50:24, MED 0, localpref 100 [10:56:39] AS path: 64600 I, validation-state: unverified [10:56:39] > to 10.132.0.12 via ae1.520 [10:56:39] [BGP/170] 00:04:39, MED 100, localpref 100 [10:56:39] AS path: 64600 I, validation-state: unverified [10:56:41] > to 10.132.0.13 via ae1.520 [10:56:45] ema: go ahead [10:56:58] looks good from the router side as well [10:57:48] curl --resolve en.wikipedia.org:443:103.102.166.224 -s -I https://en.wikipedia.org/wiki/Main_Page [10:57:51] this works fine ^ [10:58:06] and I'm seeing active conns on 5003's ipvsadm -Ln [10:58:15] <3 [10:59:09] https://grafana.wikimedia.org/dashboard/db/load-balancers?panelId=19&fullscreen&orgId=1&from=now-1h&to=now [11:00:38] nice [11:05:44] OpenSSL 1.1 as used for TLS termination on cp* is already upgraded by Valentín, I had a look at the services using OpenSSL1.0.2; most of the services is restarted by wmf-auto-update, but a few remain: varnishstatsd, varnishreqstats, varnishospital, varnishslowlog and varnishkafka. their current process life time correlates with the uptime [11:06:19] should be just ignore them for openssl updates, are they tricky to restart in general (at least for varnishkafka I think that's the case)? [11:07:21] moritzm: all those services can be restarted without notice, perhaps with the exception of varnishkafka? elukey? [11:08:09] vgutierrez: anything specific you want to test, or can I restart pybal on 5001? [11:08:49] ema: everything looks good, go ahead :D [11:09:00] ema: ah, good. I'll run some tests and prepare patches for adding them to wmf-auto-update, then [11:09:26] yey.. varnishkafka-friends requires at least pinging elukey / analytics [11:10:03] lvs* only has openssl rdeps which are auto-restarted (NRPE and friends, upgrading OpenSSL there now) [11:10:04] that + rebooting them once a time [11:11:03] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122849 (10Vgutierrez) [11:12:05] ema: so.. if we ignore the reimaging issues, it wasn't that hard [11:12:36] the hardest part was probably guessing the interface name :D [11:13:12] for eth0 is solved here: https://phabricator.wikimedia.org/P6940 [11:13:39] magic! [11:13:50] taking into account that ONBOARD > PATH [11:14:04] so in lvs2001 it will be eno1 instead of enp3s0f0 [11:15:01] perfect [11:15:22] so, here is a picture of the successful failover test: https://grafana.wikimedia.org/dashboard/db/load-balancers?panelId=20&fullscreen&orgId=1&from=1523444036690&to=1523445285720 [11:16:14] lunch time, bbl [11:16:15] so.. next steps.. continue with other secondaries like lvs4007? [11:17:23] (yes I think 4007 is a good next candidate) [11:17:50] I'll prepare the CR after lunch.. I'm hungry as hell right now [11:41:04] catching up on a few things above: [11:41:16] ema> I'm still fascinated by eqsin<->codfw rtt being lower than eqsin<->any_other_dc [11:41:35] because the main transport link for eqsin is eqsin<->codfw (not ulsfo or other), so it all goes there first. [11:44:46] re: lvs5 nameservers config: yes, they were installed before the nameserver stuff was re-configured. I'm not sure to what degree dns-nameservers in /e/n/i even matters if we have explicit resolv.conf puppetized, but it can't hurt to manually patch it up to match in any case. [11:46:38] sounds like we might have duplicate txqlen in /e/n/i as well? may have been the result of the puppetization change for txqlen, ugh [11:46:56] it doesn't technically hurt anything though, I don't think [11:47:22] augeaus is what it is heh [12:28:38] bblack: so, the duplicate entries were interface-rps and tx_queue_len https://phabricator.wikimedia.org/P6980 [12:29:12] what's weird is that after removing the duplicates by hand, all subsequent puppet runs did not add anything [12:31:55] ok [12:32:14] and yeah, a lot of our augeas stuff for /e/n/i is more or less one-shot, there's not a good way to "key" it [12:32:43] we've historically had lots of problems with runtime mods of related params. e.g. if you changed the txqueuelen later you might get double entries (old and new val) [12:34:05] maybe we could use /etc/network/interfaces.d/ instead of changing /e/n/i with augeas? [12:34:09] up_command is the underlying culprit I think [12:34:24] maybe, but that's kind of a broad-scope thing to be looking at here [12:34:28] bblack: that makes sense with the current up_command [12:34:35] augeas { "${interface}_${title}": [12:34:35] context => "/files/etc/network/interfaces/*[. = '${interface}']", [12:34:38] changes => "set up[last()+1] '${command}'", [12:34:40] onlyif => "match up[. = '${command}'] size == 0"; [12:34:49] the old and the new value I mean [12:35:05] basically if it sees the exact same command already there, it will do no-op. otherwise it appends the command. but changing any parameter will lead to duplication [12:35:11] doesn't explain how we got an exact duplicate, though [12:44:16] CR for lvs4007: https://gerrit.wikimedia.org/r/#/c/425520/ (and I already tested ipmitool from neodymium ha!) [12:44:18] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123008 (10ema) p:05Triage>03High [12:44:42] moritzm,ema,vgutierrez - yes I'd prefer to keep varnishkafka as exception, and restart it only when needed (or as few as possible) [12:45:00] sure [12:53:20] elukey: when we do need to, what's the process? is it just noted somewhere so that statistical anomalies that others notice have a reason? [12:54:24] in general, the dependency model of varnish<->vk (and similar logging daemons) has always been not-ideal... I think in the ideal world we have software solutions to this.... [12:54:39] probably the ideal model involves: [12:55:17] bblack: a simple restart is fine, but recently I have been thinking if there is some chance that we loose a bit of data when doing so. varnishkafka starts reading from the tail of the shm log, so if varnish is not depooled completely and vk restarts, it will probably skip some records [12:55:25] (a) That all the daemons that log traffic/stats from varnish operate fine when varnishd is down/missing (and reconnect to it quickly once it's available)... as opposed to e.g. crashing or burning huge cpu cycles when it's down or failing to reconnect promptly, etc... [12:56:14] (b) That we make the varnishd services depend on (in the systemd sense) the vk-like daemons, whereas today it might be the other way around. in other words, the logger must be ready before traffic starts flowing, and if you want to take down the logger that also means taking down the varnishd first. [12:57:03] (c) Making sure pooling works right with that automatically would be ideal as well, so that a naive "service varnishkafka restart" not only implies first stopping varnishd, but also depooling and waiting period before varnishd dies. [12:59:02] some bits of that puzzle have been worked on, and some bits are already in the right states or close to it, but as a whole there are still some gaps and work to do and sanity-checking before we reach such an ideal state [12:59:09] there's a related ticket or two somewhere-or-other I think [13:01:13] https://phabricator.wikimedia.org/T128374 [13:01:17] yep :) [13:02:03] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123036 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo... [13:02:25] vk is able to start and "wait" for varnish to open a shm log, but indeed a complete drain of the varnish traffic and its shm log would be ideal when vk restarts [13:02:58] I haven't tried to quantify the amount of "loss" that happens when vk restarts [13:03:22] well, varnish shmlog rolls over fairly quickly [13:04:09] if a vk restart is super-quick, like ~100ms edge-to-edge, maybe it's not much loss. if in practice it takes several seconds or more, you might lose something close to the same amount of traffic, as shmlog isn't a big buffer [13:04:47] the restart is very fast afaics [13:04:54] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123039 (10Imarlier) Change in observed performance due to depooling of Singapore: Synthetic tests (from AWS Mumbai): https://grafana.wikimedia.org/dash... [13:05:27] but it would be interesting to know exactly, for example, before shutting down how many records are left to read in the shm [13:06:32] and I only have the fuzziest of ideas what our average shmlog rollover-time is before the oldest record is wiped to make room for more. [13:06:39] it would be nice to know that, sometimes :) [13:07:21] lovely... running puppet in install1002 is not enough clearly [13:08:01] lvs4007 rebooted with jessie installer :/ [13:09:07] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['lvs4007.ulsfo.wmnet'] ``` [13:09:24] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123048 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo... [13:09:27] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123049 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['lvs4007.ulsfo.wmnet'] ``` [13:09:59] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123051 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo... [13:10:32] vgutierrez: I think you ahve to do install2002 as well [13:11:05] yey, I ran puppet on install2002 and rebooted lvs4007 again [13:11:11] let's see what happens now [13:13:58] right.. [13:13:58] Apr 11 13:11:52 install2002 dhcpd: DHCPREQUEST for 10.128.0.17 (208.80.153.53) from f4:e9:d4:ba:ee:f0 via 10.128.0.2 [13:14:01] Apr 11 13:11:52 install2002 dhcpd: DHCPACK on 10.128.0.17 to f4:e9:d4:ba:ee:f0 via 10.128.0.2 [13:14:04] that mac address is lvs4007 [13:14:24] and now is running stretch installer [13:24:09] nice, already booted on stretch, running puppet [13:24:16] traffic folks, I've a question for you, is this a good time? [13:24:41] * vgutierrez hides [13:25:34] volans: depends on the question! :) [13:25:39] lol [13:26:27] I'm starting to layout some puppettization for the soon-to-be released debmonitor new service and I have some related questions [13:28:04] 1) the website server part could be a generic misc-pass service, if configured in active/passive. Although apart from the login/logout/admin paths all the rest of the website is read-only (as of now) and could also be active/active, but only for some paths [13:28:32] I'm not sure if it's worth though, I guess it would require some VCL to do it [13:28:48] probably active/passive is ok, thoughts? [13:29:06] [in answer to that: we can split the a/a-vs-a/p and/or the pass-vs-cacheable on subpaths, but not in any declarative/simple way on other factors like cookies or other request attributes, just paths or hostnames] [13:29:46] [and if it's expected to be a low-volume traffic source anyways, there's nothing wrong with just going pass-only and a/p for all just to be simpler] [13:29:57] agree [13:31:36] 2) all the hosts will need to reach this service in HTTPS to do a POST to send the package list and basically update it, and I'd like to do the HTTPS with the client cert so that debmonitor can ensure that each host is allowed to updated only its own data [13:32:52] this can be done easily in nginx setting some headers on the proxy pass config, so I guess it will need some internal HTTPS endpoint, given that it cannot (and would not be ideal) to pass through varnish [13:33:06] do we have a 'standard' way of doing something like this? [13:35:12] I don't know if we have other cases that check client certs via some nginx/tlsproxy type of thing [13:35:18] to cargo-cult from [13:35:56] that should be easy, I think I already have the config that should work and I'll test it in labs first [13:36:23] I guess we'd set this up as an a/p dns-discovery service for e.g. debmonitor.discovery.wmnet, and that's what all the hosts would POST to, and then separately varnish would connected to debmonitor.svc.{eqiad,codfw}.wmnet for the user-facing part. [13:37:17] given that there will probably be just one host in eqiad and one in codfw the user-facing part could also be done putting directly the hostnames in teh varnish director I guess [13:37:36] but if the internal part requires an LVS endpoint ofc we can use that for the user-facing part too [13:38:22] it doesn't require LVS, unless there's more than one backend host for varnish to talk to per-DC, at which point LVS is our standard way of abstracting that (we don't put multiple backends from a single DC into the varnish definitions directly) [13:39:06] for some simpler cases where at some other layer of decision-making we've decided 1x host in each of the 2x DCs is sufficient redundancy+scaling, there's no point having an LVS service to do a 1:1 mapping. [13:39:23] indeed, one per-DC should be enough I don't think we'll need more [13:39:44] it does imply it's less-resilient than what we expect of other more production-y services, but I leave that thinking up to you :) [13:40:28] (e.g. we may take offline willfully, or lose ungracefully, one of the core DCs, and it might stay gone for a while. at that point you have no live redundancy except installing on some new/spare hardware if your 1 remaining host fails concurrently) [13:41:25] yes, but all the data will be in our redundant DBs and deploying a new VM with that role, if needed, shouldn't take long, in case of this double concurrent failure [13:41:31] for the more-productiony services this is an anti-pattern. we prefer to tell people that the installation in a single DC should scale to handle the global load and be fully-resilient on its own, and that DC-level redundancy is a separate matter on top of that. [13:42:07] as for the 'missing' data that might have been lost it will be reconcilated automatically by a daily job,so no concern on that side [13:42:15] but yeah, I'm ok with it. not arguing, just talking aloud :) [13:42:44] I'm ok also with 2 per-DC, no problem at all, just wondering if it's a bit too much for the kind of service [13:42:51] (if nothing else, to make sure the pattern being used here isn't perceived by some onlooker as a valid thing to cargo-cult for another service of a different kind!) [13:43:05] eheheh good point [13:43:31] we can also establish that discovery stuff must be at least 2xDC [13:44:14] yes, there's no point configuring the discovery-dns stuff if the service only exists in one place, if that's what you mean. [13:44:55] although I guess if it's a temporary initial state with plans to deploy at the other, we might standardize that it starts out using foo.discovery.wmnet manually-configured, so that adding proper dns-disc later doesn't require a change of endpoint hostname. [13:45:53] manually you mean by CNAME? [13:46:03] I still suspect in the very long run, when we reach a more-ideal state, we might find dns-disc to be more-or-less not-so-useful, but I haven't given sufficient time and attention to that line of thought to really vet it and lay it all out [13:46:33] (well, yes, or manually by just putting in the same IPs for foo.svc.eqiad.mwnet and foo.discovery.wmnet. CNAME is mostly-evil and probably best avoided except when really necessary) [13:47:40] back on that line before: still, I tend to think of it mostly as a useful hack for the present, I haven't convinced myself it's a correct part of the final solution. (dns-disc as we have it today) [13:49:07] yes I remember your long-term concern [13:50:12] so in this case I think I'll have to settle on doing foo.svc.eqiad.mwnet where 80 will be used by varnish and 443 by the internal clients, so I'll have to manage both 80 and 443 at the nginx level [13:50:16] is that correct? [13:51:44] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123145 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ``` and were **ALL** successful. [13:53:04] also because probably the internal 443 certificate will be a puppet-CA generated one, in order to use the puppet client certs when connecting internally [13:53:54] yeah [13:54:00] varnish doesn't do https anyways :) [13:54:04] yeah! [13:54:28] could restrict the port 80 to cache_misc if you want to avoid mis-use, there's a pattern floating around puppet that does that with ferm rules, for better or wose. [13:54:31] *worse [13:55:08] interesting, I'll have a look [13:55:28] thanks a lot, I've a much clear idea now, just need to translate this into puppet :-P [13:56:18] moritzm: the LVS servers recently upgraded to stretch (lvs5003 and lvs4007) are running with linux 4.9.82-1+deb9u3 (2018-03-02), while those on jessie run 4.9.82-1~wmf1 (2018-02-19) [13:56:32] I assume it's all bueno but wanted to double-check with you :) [13:57:16] * vgutierrez runs to YADA (yet another dentist appointment) [13:58:57] ema: yeah, that's all fine. the difference on stretch is just naming (~wmf1 compared to +deb9u2) and an irrelevant bugfix for powerpc64 (deb9u2->deb9u3) [14:07:33] bblack: if you have some more free time: about my new wdqs-internal service (https://gerrit.wikimedia.org/r/#/c/424587/ & https://gerrit.wikimedia.org/r/#/c/424599/ ) was there anything else to correct? [14:19:23] so, lvs4007 looks fine. All services are defined properly according to ipvsadm, IPs are in the right place, the interface-rps machinery looks good and so on [14:19:37] I'm gonna re-enable bgp [14:20:42] ok [14:21:13] I'd like to pause here for at least a day if you don't mind, and find some time in my day later to go deep-dive around these hosts manually and validate little details and such, and give them some runtime too. [14:21:21] (re stretch LVSes) [14:21:32] sure! [14:22:06] We've validated that basic LVSing works fine w/ lvs5003, but a deeper look is most welcome if you've got the time [14:36:58] yeah mostly I'm just wondering if all the fine details came over too, scalability hacks and tuning and etc etc [14:37:07] those won't cause immediately-obvious functional problems [14:37:42] with that stuff, I'm more worried that we've missed a non-obvious fallout of the interface renaming. the kernels were already about the same. [14:39:39] hopefully there's nothing like `if blah | grep eth` lurking in the darkness of our puppet code :) [14:42:42] bblack: something interesting happened on cp2022 after T191229 -> https://phabricator.wikimedia.org/P6979 [14:42:42] T191229: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229 [14:42:48] I know interface-rps has at least one place where it defaults to assuming ethX naming, but I think on LVS we use explicit arguments so the default doesn't get used. [14:43:25] ema: re 2022 that was during startup I believe, if it's the same one from before [14:43:40] notice the time change between the first and the second log line ​Apr 11 02:11:57 -> ​Apr 10 23:11:06 [14:43:41] basically the machine had been offline a long time and the motherboard was replaced, and system time was way off on bootup [14:43:52] systemd corrected it with a big timestep *after* other services had started [14:44:03] which crashes varnishd [14:44:21] it wasn't pooled yet at the time, and I restarted the outer whole varnish service that crashed to correct it and get the restart count back to 1 in icinga [14:44:50] mmh, I've noticed because of an icinga warning for cp2022's varnish-be child restarts [14:45:02] perhaps you've tackled the issue on another system? [14:45:05] oh maybe we're talking about different things then [14:45:24] it was 2022, but maybe when I looked only fe had crashed so far and I only restarted that, and -be noticed/crashed later? [14:45:48] I probably should've restarted both heh [14:46:51] yes varnish-fe crashed at 18:12:26 on Apr 10 [14:47:17] while the backend crashed at 23:11:06 [14:47:39] right, so I guess varnishd noticing the time-step and choosing to crash is a race-case, maybe one where the fe is more likely to find it or whatever. [14:47:54] just took longer for the be to notice it [14:48:03] (maybe for an idle thread to wake up or whatever) [14:48:43] I think papaul's about to take down a different 2x upload@codfw shortly [14:49:00] so maybe if the only issue right now is the icinga restart-count, leave it alone so we don't lose a 3rd cache's contents concurrently [14:49:33] there's currently no issue, I've restarted cp2022's backend 6 hours ago :) [14:49:44] ok! :) [14:50:17] 7 even! Time flies when you're having a blast [14:52:56] gehel: ok so the UDP monitor did not break anything \o/ [14:53:05] gehel: enabling it for logstash-{json,syslog}-udp too [14:53:12] ema: thanks! [14:54:30] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123393 (10Krinkle) [14:58:49] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123399 (10Vgutierrez) [15:15:28] ema: whenever you have time we could discuss to remove IPSEC from cp hosts to kafka1012->23 (since all the vk traffic has been migrated to jumbo) [15:18:10] yeah I mentioned that in the monday meeting. otto said something I think along the lines of their being other non-vk traffic in that path though, maybe something like statsd, etc? [15:18:21] I have no idea [15:18:31] but now I do wonder if there's other non-vk flows to worry about there [15:20:54] so all the vk traffic has been migrated to jumbo, with the exception of the statsv instance (powering performance data, IIUC not PII) that goes to kafka100[123] [15:21:19] nothing should (in theory) leave the cp hosts to end up in kafka1012->23 anymore [15:22:08] not sure about other non-vk-analytics flows [15:27:26] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4123551 (10debt) Hi @BBlack - can you add your concerns to this ticket....we're needing to get this figured out soon. Thanks! [15:31:22] 10Traffic, 10Operations, 10ops-codfw: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123569 (10Papaul) DIMM A2 replaced DIMM B2 replaced DIMM B6 replaced [15:43:20] elukey: I assume we could start by getting rid of non-jumbo nodes from cache::ipsec's config? We're not using those anymore right? https://gerrit.wikimedia.org/r/425550 [15:46:15] well no, I do not assume that. I imagine that. :) [15:46:15] 10Traffic, 10Operations, 10ops-codfw: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123634 (10RobH) system has been pushed back into service with the new memory in use [15:49:43] ema: that sounds about right, but probably also need to remove the ipsec-role stuff from those kafka nodes as well, so that they don't try to set up their side of the relationship and fail puppetization/icinga-checks as well. [15:50:30] oh, yes [15:57:57] CR updated although of course `not including ::role::ipsec` != `getting rid of all the crap role::ipsec scattered on the floor` [15:58:24] 10Traffic, 10Operations, 10ops-codfw: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4123709 (10Papaul) DIMM B3 replaced BIOS update IDRAC update [16:03:34] 10Traffic, 10Operations, 10ops-codfw: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123741 (10RobH) also note I rebooted cp2008 into the post and debian kernel selection screen 7 times, without any memory post errors. [16:04:43] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4123756 (10Papaul) 05Open>03Resolved [16:04:57] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10Papaul) [16:05:00] 10Traffic, 10Operations, 10ops-codfw: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123757 (10Papaul) 05Open>03Resolved [16:07:41] ema: <3 - in standup now but I'll check the code review later on! [16:17:37] 10Traffic, 10Operations, 10ops-codfw: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4123838 (10Papaul) DIMM A2 replaced DIMM A6 replaced BIOS update IDRAC update [16:19:54] 10Traffic, 10Operations, 10ops-codfw: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4123845 (10RobH) so we rebooted this system half a dozen times through post and kernel section splash screen and no more memory errors. [16:37:18] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123938 (10ayounsi) Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Routing [16:38:02] 10Traffic, 10Operations, 10ops-codfw: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4123940 (10RobH) rebooted this half a dozen times after the memory swap, and no memory errors have cropped back up. pushed back into service. @papaul: can you please post the return tag tracking numb... [17:25:14] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4124125 (10Papaul) [17:25:17] 10Traffic, 10Operations, 10ops-codfw: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4124122 (10Papaul) 05Open>03Resolved a:05Papaul>03None I do not have them Dell tech already took all the boxes [17:25:35] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10Papaul) [17:25:42] 10Traffic, 10Operations, 10ops-codfw: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4124126 (10Papaul) 05Open>03Resolved