[00:00:30] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182459 (10BBlack) >>! In T121135#1910435, @Atsirlin wrote: > @Legoktm: Frankly speaking, for a small project like Wikivo...
[00:44:53] <wikibugs>	 7Varnish, 6Performance-Team: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2182569 (10ori) A few months ago, we (very briefly) enabled reporting of backend latency via the stats interface in MediaWiki for all API request, and it overwhelmed the central statsd aggregator. Since...
[00:58:55] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182577 (10Jdlrobson) >>! In T121135#2182349, @Wrh2 wrote: > Cache is cleared fairly regularly even if articles aren't ed...
[01:18:54] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182584 (10Wrh2) If the template was at fault the behavior should be consistent - currently if a page is edited or flushe...
[01:29:05] <wikibugs>	 10Traffic, 6Operations: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182588 (10BBlack)
[01:29:17] <wikibugs>	 10Traffic, 6Operations: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182601 (10BBlack) p:5Triage>3Normal
[01:45:17] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182613 (10Jdlrobson) >>! In T121135#2182584, @Wrh2 wrote: > If the template was at fault the behavior should be consiste...
[01:51:32] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182614 (10Jdlrobson) My current theory is that under some circumstances the banner is generated before the table of cont...
[09:07:12] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2182880 (10Gehel) @RobH I'd really appreciate if you could let me do the reclaim / reinstall so that I learn something in the process (this...
[12:34:14] <ema>	 I'm thinking of how to refactor cluster_be_recv_applayer_backend in misc-backend.inc.vcl.erb
[12:34:20] <ema>	 it's pretty ugly at the moment :)
[12:34:52] <ema>	 perhaps we could map host headers to backend names in hieradata/common/cache/misc.yaml?
[12:35:06] <ema>	 something along these lines:
[12:35:08] <ema>	 host_backends:
[12:35:08] <ema>	     git.wikimedia.org:
[12:35:08] <ema>	         - 'antimony'
[12:35:08] <ema>	     doc.wikimedia.org:
[12:35:11] <ema>	         - 'gallium'
[12:35:13] <ema>	     integration.wikimedia.org:
[12:35:16] <ema>	         - 'gallium'
[12:35:18] <ema>	     download.wikimedia.org:
[12:35:21] <ema>	         - 'dataset1001'
[12:35:23] <ema>	     gerrit.wikimedia.org:
[12:35:26] <ema>	         - 'ytterbium'
[12:35:57] <ema>	 and then iterate over those definitions in the VCL template 
[12:36:43] <ema>	 at the end of the erb loop we would still have to hardcode the exceptions such as yarn.wikimedia.org and planet
[12:49:01] <bblack>	 there's a phab ticket for something like that, but a little more ambitious
[12:49:05] <bblack>	 I'm trying to find it now heh
[12:58:24] <bblack>	 hmmm I still can't find it, and I know it's there somewhere....
[12:59:09] <bblack>	 ah!
[12:59:10] <bblack>	 https://phabricator.wikimedia.org/T110717
[13:00:18] <bblack>	 ema: ^
[13:02:58] <bblack>	 digging for that reminded me that there are a ton of backlogged Traffic tickets that need looking at for basic triage/reject/commentary/whatever
[13:06:27] <moritzm>	 bblack: ema and I talked earlier about rebooting cp* systems into the new 4.4 kernel. before that happens I would install the systemd and glibc bugfix updates from the jessie 8.4 point release on the cp* hosts:
[13:06:32] <moritzm>	 https://packages.qa.debian.org/s/systemd/news/20160313T213434Z.html
[13:06:38] <moritzm>	 https://packages.qa.debian.org/g/glibc/news/20160229T073251Z.html
[13:06:52] <moritzm>	 (since these are only fully active when rebooted)
[13:09:46] <bblack>	 ok
[13:09:53] <bblack>	 are these already available as updates?
[13:09:58] <ema>	 yep
[13:10:42] <bblack>	 have done any caches as 4.4 canaries yet? I don't even remember
[13:10:59] <ema>	   cp1071.eqiad.wmnet
[13:10:59] <ema>	   cp1067.eqiad.wmnet
[13:10:59] <ema>	   cp1008.wikimedia.org
[13:10:59] <ema>	   cp4006.ulsfo.wmnet
[13:10:59] <ema>	   cp3048.esams.wmnet
[13:11:44] <bblack>	 ok
[13:12:45] <bblack>	 so we need: apt-get upgrade && install kernel-4.4 on all the cp*
[13:13:02] <bblack>	 and then after that's done, we can start on a cycle of depooled reboots
[13:13:12] <bblack>	 keeping in mind that some of them are going to fail to reboot :/
[13:13:22] <ema>	 I'm not sure whether moritzm wanted to upgrade all packages or only a selected few actually
[13:13:40] <bblack>	 well, we need the upgrades regardless, they're stacking up
[13:13:59] <bblack>	 in general we tend to trust upstream pushes fixes for a good reason
[13:14:10] <bblack>	 The following packages will be upgraded: base-files bind9-host dnsutils firmware-bnx2x host initramfs-tools libbind9-90 libc-bin libc-dev-bin libc6 libc6-dev libcairo2 libdns-export100 libdns100 libglib2.0-0 libgraphite2-3
[13:14:14] <bblack>	  libgtk2.0-0 libgtk2.0-bin libgtk2.0-common lbirs-export91 libisc-export95 libisc95 libisccc90 libisccfg-export90 libisccfg90 liblwres90 libpam-modules libpam-modules-bin libpam0pg libsystemd0 libudev1 linux-image-3.16.0-4-amd64 linux-libc-dev locales multiarch-support ruby systemd systemd-sysv udev
[13:14:46] <bblack>	 looks like mostly BIND-related, systemd, glibc/pam, and ruby
[13:15:59] <bblack>	 is there any real objection to updating everything?
[13:16:32] <moritzm>	 no, it's fine
[13:16:44] <ema>	 moritzm: including initramfs-tools?
[13:17:06] <bblack>	 https://phabricator.wikimedia.org/T126062 is going to make the reboots painful in esams
[13:17:19] <moritzm>	 for fleet-wide upgrades I upgrade individual source packages, but invidual upgrades when bundled with reboots are also fine
[13:17:20] <bblack>	 still no real investigation or resolution on that, although maybe if we're lucky 4.4 fixes it
[13:18:00] <moritzm>	 ema: if it doesn't break with the first host, it should be no problem
[13:18:05] <ema>	 right
[13:18:30] <bblack>	 if 4.4 doesn't fix it, they all need help at the grub prompt, to add modprobe_blacklist=ipmi_si just to boot.  and the ones already booted that way might be incapable of commandline reboot this time regardless, needing racadm-based reboot at the end.
[13:18:32] * ema stares at gtk packages on cp* hosts and sheds a tear
[13:18:35] <moritzm>	 I don't expect trouble, but it's the only risky change due to our kernel discepancy
[13:18:46] <moritzm>	 I think I filed a but for that
[13:18:50] <moritzm>	 bug
[13:18:53] <moritzm>	 let me find it
[13:18:54] <bblack>	 what brings them in?
[13:19:17] <moritzm>	 https://phabricator.wikimedia.org/T127054
[13:19:33] <bblack>	 oops typo: modprobe.blacklist=ipmi_si
[13:20:18] <moritzm>	 this can probably be fixed in the late_install script in d-i, but haven't tested that yet
[13:21:43] <bblack>	 relatedly, authdns and lvs boxes aren't up to date on packages either, and authdns is still back on 3.19 too
[13:22:34] <moritzm>	 several hosts with reboot problems in codfw have been fixed by Papaul with an idrac update
[13:22:44] <moritzm>	 In ran into those with the last kernel updates
[13:22:53] <ema>	 is it pinentry-gtk2 bringing in the gtk dependencies?
[13:23:51] <moritzm>	 https://phabricator.wikimedia.org/T130008
[13:23:59] <moritzm>	 ema: yes
[13:24:32] <moritzm>	 I'm not sure which exact idrac version Papaul installed, though
[13:29:54] <bblack>	 ema: want to try a test host and make sure about initramfs-tools? maybe a text or upload node in eqiad?
[13:30:08] <bblack>	 (with all packages upgraded + 4.4 install + reboot)
[13:30:37] <ema>	 sounds good!
[13:31:00] <wikibugs>	 10Traffic, 10MediaWiki-Interface, 6Operations: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2183332 (10BBlack) 5Open>3Resolved I'm guessing by now they're all naturally expiring out anyways since there's no further feedback.
[13:31:58] <ema>	 let's say cp1052 (text)
[13:32:26] <ema>	 moritzm: apt-get upgrade first and then apt-get install linux-image-4.4.0-1-amd64?
[13:32:50] <ema>	 this way we can see how well the new initramfs-tools and our kernel play together
[13:33:12] <wikibugs>	 10Traffic, 6Operations, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183337 (10BBlack) p:5Low>3Normal We didn't end up keeping SPDY disabled, and HTTP/2 is coming.  From our end, this is a relatively simple change now, but t...
[13:33:39] <bblack>	 +1
[13:35:57] <moritzm>	 ema: either that or simply install linux-meta (which will pull in the latest package)
[13:35:57] <wikibugs>	 10Traffic, 6Operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183342 (10BBlack) We actually dug further into related issues when investigating WDQS woes on cache_misc, and the problem is different than what we thought we understood i...
[13:36:20] <wikibugs>	 10Traffic, 6Operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183343 (10BBlack)
[13:36:22] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2183344 (10BBlack)
[13:36:31] <ema>	 cool
[13:36:49] <ema>	 alright, depooling cp1052 then
[13:38:59] <wikibugs>	 10Traffic, 6Operations, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183349 (10BBlack)
[13:39:02] <wikibugs>	 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2183350 (10BBlack)
[13:39:18] <ema>	 moritzm: is there a phab tracking kernel upgrades? I could only find https://phabricator.wikimedia.org/T126320
[13:41:01] <wikibugs>	 7HTTPS, 10Traffic, 10Citoid, 6Operations, and 3 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#2183351 (10BBlack) 5Open>3Resolved a:3BBlack When we moved various *oid to the text cluster as part of the parsoidcache decom, they got forced to HTTPS-o...
[13:42:05] <wikibugs>	 10Traffic, 6Operations: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#2183359 (10BBlack) 5Open>3declined We've made decisions about this in the past already and moved past this idea.  The general direction is to always use LVS for multi-host varnish backends, and solve HTTPS iss...
[13:42:23] <moritzm>	 ema: not yet, let me create one
[13:42:30] <ema>	 thanks
[13:43:33] <moritzm>	 https://phabricator.wikimedia.org/T131928
[13:44:47] <bblack>	 I'm going to try upgrading baham and see how that flies (authdns latest packages + 4.4), once cp1052 comes back ok
[13:47:38] <ema>	 apt-get upgrade done on cp1052
[13:47:55] <ema>	 linux-meta recommends firmware-linux-free, do we need it?
[13:48:18] <ema>	 moritzm, bblack: ^
[13:48:35] <bblack>	 not sure.  I haven't installed any recommended-only packages in the past
[13:48:47] <bblack>	 let's see what happened on LVS
[13:49:12] <ema>	 it's not installed on cp1071 (running 4.4) so I assume we don't need it
[13:49:36] <bblack>	 root@lvs1010:~# dpkg-query -s firmware-linux-free
[13:49:37] <bblack>	 dpkg-query: package 'firmware-linux-free' is not installed and no information is available
[13:49:40] <bblack>	 yeah same on lvs1010
[13:50:09] <ema>	 alright! 4.4 installed, ready to reboot
[13:52:23] <ema>	 reboot in progress
[13:52:52] <bblack>	 so when we get around to the part about rebooting every cache...
[13:53:27] <bblack>	 what I've done in the past is generated a hostname list, and then randomly shuffle the file around so we're a little less likely to hit the same cluster+dc for multiple hosts in a row
[13:55:06] <bblack>	 and then do a while loop over that list that goes through them one at a time on neodymium, using confctl to depool it, salt command to downtime the host in icinga for 15 minutes, then salt it a reboot command (using "at" to defer it, so the salt command returns immediately and correctly), then sleep before doing the next
[13:55:10] <ema>	 cp1052 is back up
[13:55:17] <bblack>	 and set the sleep time to give enough spacing for recovery and so-on
[13:55:33] <bblack>	 where that falls apart is when more than just one or two nodes fail to reboot cleanly on their own, as might be the case for esams still...
[13:57:53] <ema>	 uh nginx didn't start properly on cp1052
[13:58:11] <bblack>	 did puppet finish running before you checked?
[13:58:27] <bblack>	 yup
[13:58:34] <ema>	 Apr 06 13:55:48 cp1052 nginx[3492]: nginx: [emerg] chown("/var/lib/nginx/body", 33) failed (1: Operation not permitted)
[13:58:44] <bblack>	 yeah
[13:58:49] <bblack>	 that's a race, nice
[13:59:23] <bblack>	 hmmmmm
[13:59:33] <bblack>	 it has to do with out /var/lib/nginx mount
[13:59:52] <bblack>	 and the new systemd sec unit file
[14:00:07] <bblack>	 let me poke at it a bit
[14:00:12] <ema>	 go ahead
[14:01:41] <bblack>	 ok fixed temporarily there
[14:01:50] <bblack>	 we need CAP_CHOWN in our sysd sec file for nginx
[14:02:15] <bblack>	 it wasn't apparent in earlier testing because we only restarted the service.  rebooting the host wipes out the contents of the /var/lib/nginx tmpfs mount, resetting already-set perms
[14:02:29] <bblack>	 uploading a patch...
[14:02:49] <ema>	 nice
[14:07:21] <bblack>	 I'm salting puppet on all the caches now to pick up the nginx+sysd fixup
[14:08:20] <bblack>	 also, about to try baham
[14:08:28] <ema>	 varnishkafka-webrequest also had something to complain about
[14:08:41] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183411 (10RobH) >>! In T131880#2182880, @Gehel wrote: > @RobH I'd really appreciate if you could let me do the reclaim / reinstall so that...
[14:08:42] <ema>	 Apr 06 13:56:27 cp1052 varnishkafka[1346]: KAFKAERR: Kafka error (-195): kafka1014.eqiad.wmnet:9092/14: Failed to connect to broker at [kafka1014.eqiad.wmnet]:9092: Network is unreachable
[14:09:13] <ema>	 however it was running properly from systemd's point of view
[14:09:23] <bblack>	 well was that a temporary error?
[14:09:51] <ema>	 not sure, the last log was an error. I'd expect a reassuring message afterwards in case of temporary issues?
[14:11:20] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183429 (10BBlack) There's no real need to reinstall them.  I have patches pending to put them into their proper roles, etc.
[14:12:13] <bblack>	 I wouldn't expect anythign sane out of vk
[14:12:21] <bblack>	 does it show a good conn now in lsof?
[14:12:42] <bblack>	 if that was a fatal error with no retry, you'd think the daemon would just exit
[14:13:13] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183432 (10BBlack) The patch series starts at: https://gerrit.wikimedia.org/r/#/c/268236/ , but needs manual rebases at this point.
[14:13:21] <wikibugs>	 10Traffic, 10DNS: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2183433 (10Nemo_bis)
[14:13:48] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183446 (10BBlack) (it's better to look at T109162, that had all the patch links)
[14:13:57] <ema>	 yeah connections look good
[14:15:28] <bblack>	 ok starting on baham...
[14:15:38] <ema>	 repooling cp1052 then
[14:19:47] <bblack>	 heh, ferm on authdns (and probably others) has bad startup dependency loops
[14:22:41] <bblack>	 well, there are other issues too, I don't think any of it's from the new packages/kernel, we've seen it before
[14:23:32] <moritzm>	 did you run into https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796611 ?
[14:23:48] <moritzm>	 I'm considering to backport the systemd unit to our jessie package
[14:24:54] <bblack>	 I don't know...
[14:25:06] <bblack>	 no, I don't think it's that, but not 100% sure
[14:25:47] <bblack>	 the apparent root-most problem I see on baham (and this was true last time I rebooted any of our authdns boxes IIRC) is that on startup, it thinks network is up and starts all kinds of services, but the network isn't working
[14:26:10] <bblack>	 I can ping some nearby hosts, e.g. from baham (208.80.153.13), I can ping -n on 153.14
[14:26:20] <bblack>	 but the ipv4 default route is to 153.1, and I can't ping that
[14:26:36] <bblack>	 all the other services start anyways, can't resolve hostnames because they can't reach our DNS recursors, etc...
[14:26:56] <bblack>	 eventually several minutes later, pings to the def gw start working, and now I have to go restart a bunch of borked services
[14:27:25] <bblack>	 ferm in particular is a victim because it needs to do DNS lookups to resolve hostnames for rules
[14:27:47] <bblack>	 but I don't know if that's also separately a chicken and egg thing, if we're trying to start ferm before the interfaces are online or something
[14:27:52] <bblack>	 (or somehow indirectly related)
[14:28:08] <moritzm>	 right, I also ran into this on radon with an earlier reboot
[14:28:22] <moritzm>	 it just magically starts to work after a minute of digging :-)
[14:39:40] <ema>	 the fact that no iptables rules were defined on baham depends on the fact that ferm didn't do its thing on boot I guess
[14:40:09] <bblack>	 yeah
[14:40:11] <bblack>	 I started ferm now
[14:40:22] <bblack>	 ferm couldn't resolve hostnames it needs for rules
[14:40:33] <bblack>	 so things started "working" for sure around :25, maybe a little earlier
[14:40:36] <bblack>	 syslog around there is:
[14:40:38] <bblack>	 Apr  6 14:24:37 baham systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 1785 (check_disk)
[14:40:41] <bblack>	 Apr  6 14:24:37 baham systemd[1]: Mounting Arbitrary Executable File Formats File System...
[14:40:44] <bblack>	 Apr  6 14:24:37 baham systemd[1]: Mounted Arbitrary Executable File Formats File System.
[14:40:47] <bblack>	 Apr  6 14:24:37 baham rsyslogd0: action 'action 1' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/0 ]
[14:40:50] <bblack>	 Apr  6 14:24:37 baham rsyslogd-2359: action 'action 1' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/2359 ]
[14:40:53] <bblack>	 Apr  6 14:25:01 baham CRON[1806]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
[14:40:56] <bblack>	 Apr  6 14:25:01 baham ntpd_intres[1441]: DNS chromium.wikimedia.org -> 2620:0:861:2:208:80:154:157
[14:40:59] <bblack>	 so at 25:01, DNS is working again because routing is working again
[14:41:09] <bblack>	 does that automount for binfmt_misc have something to do with all of this indirectly?
[14:41:47] <bblack>	 the 25:01 timestamp could be after some retry-delay for ntpd_initres
[14:41:56] <bblack>	 I wonder if rsyslogd triggered that mount, or responded to it?
[14:44:19] <ema>	 bblack: uh the varnishkafa-webrequest problem on cp1052 is likely related to what we're seeing on baham
[14:44:41] <ema>	 as in: systemd thinks the network is up, starts varnishkafka, [kafka1022.eqiad.wmnet]:9092: Network is unreach
[14:44:47] <ema>	 *able
[14:45:22] <bblack>	 yeah
[14:45:31] <ema>	 in that case however the automount for binfmt_misc happens at 13:56:07
[14:45:44] <ema>	 and at 13:56:27 the network is still unreachable
[14:45:47] <bblack>	 it just doesn't take so long to fix itself on caches for whatever reason
[14:46:38] <bblack>	 lldpd config is off on baham too, and it complains on startup.  surely not related?
[14:47:22] <wikibugs>	 10Traffic, 6Labs, 6Operations, 10Tool-Labs, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10zhuyifei1999)
[14:50:52] <bblack>	 lldpd config seems to be borked lots of places, hmmmm
[14:51:02] <bblack>	 anyways, I have an interview to do on the hour
[14:51:44] <bblack>	 restarted ntp on baham, that was the only thing red in icinga, should recover itself eventually now
[14:51:52] <bblack>	 back in an hour or so
[14:52:02] <ema>	 see you later
[14:53:16] <wikibugs>	 10Traffic, 6Labs, 6Operations, 10Tool-Labs, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10valhallasw) Does Wikipedia Zero include non-wikipedia domains? I would expect tools.wmflabs.org to fall out of scope.
[14:53:37] <bblack>	 before we start on the reboots, might ping papaul about whether we can query iDRAC rev and/or fix it on the way
[14:53:40] <bblack>	 for the esams issues
[14:53:57] <bblack>	 but IM HO we can go ahead with the package updates + kernel installs on them all for now and sort out the reboots after
[14:54:42] <ema>	 alright then I'll go ahead with the updates on all cp* hosts
[15:00:59] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183560 (10RobH) @gehel: Since this isn't going to end up being a reinstall, I'll ping you to do a reinstall on one of the many I do every w...
[15:14:49] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183605 (10mmodell) I'm sure we could hack the Jenkins job to use https but the staging...
[15:20:48] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183636 (10mmodell) Why is labs blocked from connecting to ssh? Is that to avoid people...
[15:21:32] <ema>	 all cp* hosts upgraded to jessie 8.4 and kernel 4.4
[15:22:21] <bblack>	 yay
[15:22:37] <bblack>	 I'll probably wait a couple days on the other 2x authdns, since this is the first authdns+4.4 on baham
[15:23:29] <bblack>	 (plus, I'd like to have better plans or ideas before the next authdns reboot about wtf is going on with network...)
[15:27:00] <wikibugs>	 10netops, 10Analytics-Cluster, 6Operations, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183653 (10RobH) a:5RobH>3None Yes, I think we need a network admin to investigate the dhcp ability of the analytics vlan to carbon, as I cannto seem to...
[15:29:02] <ema>	 right
[15:29:07] <ema>	 bblack: on cp1008 I've installed the upgrades with apt-get install instead of upgrade to avoid downgrading nginx
[15:29:18] <bblack>	 ok, awesome
[15:57:39] <bblack>	 oh look, another meeting :P
[15:57:54] <ema>	 lol
[15:58:57] <bblack>	 so yeah on the misc-web stuff, I guess there's a bunch of horrible repetitive varnish4 conditionals if we leave the VCL stuff as-is?
[15:59:05] <ema>	 correct
[16:00:02] <bblack>	 ok I'll poke around a bit later today, see if there's an "easy" way to get halfway to https://phabricator.wikimedia.org/T110717 from where we are now
[16:00:48] <bblack>	 maybe start adding new keys to the existing $app_directors for misc or something, which can define the incoming req.http.host matches and pass/TLS behavior, et.c..
[16:01:00] <ema>	 nice. Strictly for misc, a data structure with host_eq or path_re mapping to a backend name would be enough
[16:01:16] <ema>	 but that's certainly not as ambitious as T110717 :)
[16:02:19] <bblack>	 in the general case the stuff in $app_directors is a good candidate for hiera, but I think I avoided that for now because there's still so much refactor going on...
[16:03:35] <ema>	 perhaps we could start by adding host_eq|path_re to app_directors for now and then think of a good way to refactor that into yaml?
[16:05:10] <bblack>	 I think the main holdup on hieradata yaml even for the existing app_directors is the be_opts stuff
[16:05:20] <bblack>	 'be_opts'  => merge($app_def_be_opts, { 'port' => 8080 }),
[16:05:21] <bblack>	 and such
[16:06:06] <bblack>	 obviously, we can 'fix' that by copying the defaults to every stanza...
[16:06:27] <ema>	 not that pretty though
[16:07:11] <bblack>	 before we had the layers of defaulting, and the "merge" happened down in the VCL templates, I just undid that in some earlier refactors which left us with the merge() up in puppet
[16:33:06] <paravoid>	 DegradedArray event on /dev/md/0:cp1052
[16:33:08] <paravoid>	 known?
[16:33:16] <bblack>	 it was recently-rebooted
[16:33:31] <bblack>	 probably the systemd thing again...
[16:33:39] <ema>	 no icinga alert though
[16:34:08] <bblack>	 I think we've seen this before, with a race between starting md arrays and the detection of sdX disks coming online
[16:35:45] <bblack>	 don't ask me what the real real cause is
[16:35:54] <ema>	 yup
[16:36:05] <ema>	 Apr 06 13:54:45 cp1052 kernel: md/raid1:md0: active with 1 out of 2 mirrors
[16:36:08] <ema>	 Apr 06 13:54:45 cp1052 kernel: md0: detected capacity change from 0 to 9990832128
[16:36:11] <ema>	 Apr 06 13:54:45 cp1052 kernel: ata2.00: Enabling discard_zeroes_data
[16:36:14] <ema>	 Apr 06 13:54:45 cp1052 kernel: sd 1:0:0:0: [sdb] Attached SCSI removable disk
[16:36:22] <bblack>	 yeah
[16:36:34] <bblack>	 in syslog it has precise timestamps too, the race is tiny
[16:37:24] <bblack>	 I wonder what's up with that, since it seems to be at the kernel level before systemd gets involved
[16:37:34] <bblack>	 a flag in the md config on disk to wait for all disks or something?
[16:40:37] <bblack>	 anwways, fixed for now with: mdadm --manage /dev/md0 --add /dev/sdb1
[16:40:43] <bblack>	 and now it's rebuilding the mirror
[16:40:51] <ema>	 alright
[16:41:50] <bblack>	 I'm sure it's ultimately systemd related somehow.  or at least udev related which is systemd related
[16:42:07] <bblack>	 it smells of systemd.  always racing to start everything as fast as it can in parallel, nevermind all the b0rkage
[16:44:37] <bblack>	 systemd needs a config flag for "I don't give a fuck about booting 30 seconds slower, please just boot everything correctly" :P
[16:44:47] <ema>	 :)
[16:44:53] <ema>	 but that's the kernel right
[16:45:15] <bblack>	 well it's the kernel running stuff from initramfs for udev rules and udev and systemd are in bed with each other
[16:45:18] <bblack>	 I consdier it all related
[16:45:59] <ema>	 oh in that sense
[16:47:49] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, and 2 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2184019 (10Jdlrobson)
[16:50:17] <ema>	 I remember having these issues before systemd and rootdelay=X was the workaround
[16:51:43] <ema>	 as in not exactly like this but weird race conditions while booting
[16:59:33] <bblack>	 hmmm
[17:00:13] <bblack>	 in the syslog it seems like the basic order of events is "sda detected -> part of md0 -> start md0 (without sdb, because it's not here yet) -> sdb detected"
[17:00:31] <ema>	 yep
[17:00:32] <bblack>	 so it seems like there should be some kind of udev+md solution for "wait a bit for all your disks to show up"...
[17:01:22] <bblack>	 I mean at first glance you'd say "have md not start until all disks show up", but then if a disk really is missing you don't even get a degraded start
[17:01:32] <bblack>	 so something, something has to have a timeout where it gives up and just uses the disks it has
[17:03:08] <bblack>	 in the general case there are probably edge cases like md devices which have one or more disks on pluggable storage the user might not plug in for hours
[17:09:19] <bblack>	 one thread on the internet suggests renaming them in mdadm.conf from /dev/md/0 to /dev/md0 : https://bugs.archlinux.org/task/33851#comment106076
[17:09:36] <bblack>	 something about some initramfs thing that only knows about the /dev/mdX style names
[17:09:59] <bblack>	 and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=633024
[17:11:06] <ema>	 yeah I was reading the debian bug as well
[17:11:15] <ema>	 As the documentation explains, rootdelay delays between SCSI scan and
[17:11:17] <ema>	 mdadm / LVM assembly / scan
[17:11:38] <bblack>	 yeah
[17:11:44] <bblack>	 I mean, it's a viable workaround, I just hate it
[17:11:53] <bblack>	 sticking sleeps in places instead of actually fixing the real dependency issue
[17:11:59] <ema>	 not cool
[17:15:13] <bblack>	 one of the installserver changes we have from trusty to jessie is:
[17:15:14] <bblack>	 -d-idebian-installer/add-kernel-optsstringelevator=deadline rootdelay=90
[17:19:33] <ema>	 there's a /lib/systemd/system/mdadm-last-resort@.timer
[17:19:41] <ema>	 Timer to wait for more drives before activating degraded array.
[17:19:57] <bblack>	 hmmm
[17:20:09] <ema>	 as well as /lib/systemd/system/mdadm-last-resort
[17:20:16] <ema>	 sorry, /lib/systemd/system/mdadm-last-resort@.service
[17:20:25] <ema>	 Activate md array even though degraded
[17:20:51] <bblack>	 but surely this isn't for the rootfs
[17:22:42] <ema>	 bblack: have you seen a race condition like this with 3.x kernels?
[17:24:23] <bblack>	 I think so, but I'm not 100% sure
[17:31:56] <moritzm>	 FWIW, for the recent mass reboots of 3.13 and 3.19 I only ran into this with authdns/radon
[17:32:50] <ema>	 another workaround, slightly better than rootdelay, could be scsi_mod.scan=sync
[17:52:20] <wikibugs>	 10netops, 10Analytics-Cluster, 6Operations, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184248 (10faidon) The port was also on the labs-instance-ports interface-range, which set the port-mode to trunk (and also added labs-instances1-eqiad to t...
[18:01:11] <bblack>	 today is one of those days where I have too many open windows / patches / investigations.  I'm just getting bogged down in the multitask churn.
[18:02:30] <bblack>	 we should chase down all these bootup errors before we reboot all the things for 4.4 though
[18:02:38] <bblack>	 save a lot of pain this time around and down the road
[18:03:19] <ema>	 yeah
[18:03:21] <bblack>	 I'd start with seeing if scsi_mod.scan=sync and/or rootdelay=10 or something fixes this one.  if it works as a one-off, find the right way to puppetize it and play nice with grub customization for debian
[18:03:38] <bblack>	 depool a server and reboot it like 10 times and see if the fix sticks?
[18:03:46] <ema>	 perhaps cp1008?
[18:04:08] <ema>	 if we manage to reproduce the issue there, of course
[18:04:55] <ema>	 I'll create a phab task for both boot issues in the meantime
[18:05:29] <bblack>	 ok thanks
[18:16:39] <wikibugs>	 10netops, 10Analytics-Cluster, 6Operations, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184309 (10RobH) Ok, multiple attempts have still resulted in no joy (no dhcp request hitting carbon.)  The system was also showing in the config in the def...
[18:20:38] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165259 (10Andrew) > Why is labs intentionally blocked from connecting to ssh?  Can you...
[18:29:45] <ema>	 mmh
[18:29:47] <ema>	 Apr 06 13:54:46.849944 cp1052 lldpcli[919]: unknown command from argument 1: `#`
[18:30:19] <wikibugs>	 10Traffic, 10DNS, 10Mail, 6Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2184350 (10BBlack) So, the gerrit change is held up on comments about `mx ?all` vs `mx -all`.  Are we confident phab emails only come from our mxes? ping @chasemp and...
[18:30:32] <bblack>	 ema: yeah I saw that earlier
[18:30:49] <bblack>	 but apparently lldpd isn't doing much pretty much everywhere.  I thought it was at one point in the past...
[18:30:57] <bblack>	 something's gone off the rails with our puppetization/config of it
[18:31:28] <bblack>	 I was wondering if that might somehow indirectly related to our "can't ping the default gateway" thing...
[18:33:24] <ema>	 bblack: has that happened in the past or is the 4.4/point release upgrade of today the likely cause?
[18:33:36] <bblack>	 it's happened before 4.4
[18:33:44] <ema>	 ah there you go
[18:33:46] <bblack>	 on the caches rebooting 3.19 -> 3.19
[18:34:02] <bblack>	 (the general can't ping default gateway thing, I mean)
[18:34:09] <bblack>	 sorry s/caches/authdns servers/
[18:34:29] <bblack>	 on the authdns servers that bad network state persists for minutes anyways.  maybe it happens elsewhere for shorter windows of time.
[18:34:50] <ema>	 OK, because interestingly enough on cp1052 we also saw "network issues" with varnishkafka 
[18:35:07] <ema>	 but on the otherhand ipsec seems to have worked fine
[18:35:22] <bblack>	 yeah
[18:35:37] <bblack>	 could be separate, something to do with ipv6 for kafka maybe, I saw something funky relatedly once
[18:36:07] <bblack>	 in other news, the new lvs salt grains work.  you can do things like: salt --out=raw -v -t 5 -b 1000 -G lvs:primary cmd.run '....'
[18:36:14] <bblack>	 ditto -G lvs:secondary
[18:36:26] <bblack>	 since so many things, we do them on the secondaries then primaries or whatever to be careful
[18:36:39] <bblack>	 I got tired of having to list out the hostnames among the set of -G cluster:lvs
[18:38:43] <ema>	 oh nice
[19:35:25] <wikibugs>	 7HTTPS, 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#1830027 (10BBlack) Note there are probably question-marks around these about insecure requests.  We don't yet block/deny insecure POST traffic ( T105794 ), but we'...
[19:44:55] <wikibugs>	 7HTTPS, 10Traffic, 6Operations, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184631 (10BBlack) So, we've had the API warning up for a couple of months now.  In general, we've continually fallen behind on promises to notify -> kill insecure...
[19:51:24] <wikibugs>	 7HTTPS, 10Traffic, 6Operations, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184639 (10konklone) @BBlack If you want someone to remind you about it, I am happy to volunteer. ;)
[20:41:41] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2184719 (10chasemp) 22 to only 208.80.154.250/32 as the service address for git-ssh shou...