[00:00:30] 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182459 (10BBlack) >>! In T121135#1910435, @Atsirlin wrote: > @Legoktm: Frankly speaking, for a small project like Wikivo... [00:44:53] 7Varnish, 6Performance-Team: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2182569 (10ori) A few months ago, we (very briefly) enabled reporting of backend latency via the stats interface in MediaWiki for all API request, and it overwhelmed the central statsd aggregator. Since... [00:58:55] 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182577 (10Jdlrobson) >>! In T121135#2182349, @Wrh2 wrote: > Cache is cleared fairly regularly even if articles aren't ed... [01:18:54] 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182584 (10Wrh2) If the template was at fault the behavior should be consistent - currently if a page is edited or flushe... [01:29:05] 10Traffic, 6Operations: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182588 (10BBlack) [01:29:17] 10Traffic, 6Operations: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182601 (10BBlack) p:5Triage>3Normal [01:45:17] 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182613 (10Jdlrobson) >>! In T121135#2182584, @Wrh2 wrote: > If the template was at fault the behavior should be consiste... [01:51:32] 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182614 (10Jdlrobson) My current theory is that under some circumstances the banner is generated before the table of cont... [09:07:12] 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2182880 (10Gehel) @RobH I'd really appreciate if you could let me do the reclaim / reinstall so that I learn something in the process (this... [12:34:14] I'm thinking of how to refactor cluster_be_recv_applayer_backend in misc-backend.inc.vcl.erb [12:34:20] it's pretty ugly at the moment :) [12:34:52] perhaps we could map host headers to backend names in hieradata/common/cache/misc.yaml? [12:35:06] something along these lines: [12:35:08] host_backends: [12:35:08] git.wikimedia.org: [12:35:08] - 'antimony' [12:35:08] doc.wikimedia.org: [12:35:11] - 'gallium' [12:35:13] integration.wikimedia.org: [12:35:16] - 'gallium' [12:35:18] download.wikimedia.org: [12:35:21] - 'dataset1001' [12:35:23] gerrit.wikimedia.org: [12:35:26] - 'ytterbium' [12:35:57] and then iterate over those definitions in the VCL template [12:36:43] at the end of the erb loop we would still have to hardcode the exceptions such as yarn.wikimedia.org and planet [12:49:01] there's a phab ticket for something like that, but a little more ambitious [12:49:05] I'm trying to find it now heh [12:58:24] hmmm I still can't find it, and I know it's there somewhere.... [12:59:09] ah! [12:59:10] https://phabricator.wikimedia.org/T110717 [13:00:18] ema: ^ [13:02:58] digging for that reminded me that there are a ton of backlogged Traffic tickets that need looking at for basic triage/reject/commentary/whatever [13:06:27] bblack: ema and I talked earlier about rebooting cp* systems into the new 4.4 kernel. before that happens I would install the systemd and glibc bugfix updates from the jessie 8.4 point release on the cp* hosts: [13:06:32] https://packages.qa.debian.org/s/systemd/news/20160313T213434Z.html [13:06:38] https://packages.qa.debian.org/g/glibc/news/20160229T073251Z.html [13:06:52] (since these are only fully active when rebooted) [13:09:46] ok [13:09:53] are these already available as updates? [13:09:58] yep [13:10:42] have done any caches as 4.4 canaries yet? I don't even remember [13:10:59] cp1071.eqiad.wmnet [13:10:59] cp1067.eqiad.wmnet [13:10:59] cp1008.wikimedia.org [13:10:59] cp4006.ulsfo.wmnet [13:10:59] cp3048.esams.wmnet [13:11:44] ok [13:12:45] so we need: apt-get upgrade && install kernel-4.4 on all the cp* [13:13:02] and then after that's done, we can start on a cycle of depooled reboots [13:13:12] keeping in mind that some of them are going to fail to reboot :/ [13:13:22] I'm not sure whether moritzm wanted to upgrade all packages or only a selected few actually [13:13:40] well, we need the upgrades regardless, they're stacking up [13:13:59] in general we tend to trust upstream pushes fixes for a good reason [13:14:10] The following packages will be upgraded: base-files bind9-host dnsutils firmware-bnx2x host initramfs-tools libbind9-90 libc-bin libc-dev-bin libc6 libc6-dev libcairo2 libdns-export100 libdns100 libglib2.0-0 libgraphite2-3 [13:14:14] libgtk2.0-0 libgtk2.0-bin libgtk2.0-common lbirs-export91 libisc-export95 libisc95 libisccc90 libisccfg-export90 libisccfg90 liblwres90 libpam-modules libpam-modules-bin libpam0pg libsystemd0 libudev1 linux-image-3.16.0-4-amd64 linux-libc-dev locales multiarch-support ruby systemd systemd-sysv udev [13:14:46] looks like mostly BIND-related, systemd, glibc/pam, and ruby [13:15:59] is there any real objection to updating everything? [13:16:32] no, it's fine [13:16:44] moritzm: including initramfs-tools? [13:17:06] https://phabricator.wikimedia.org/T126062 is going to make the reboots painful in esams [13:17:19] for fleet-wide upgrades I upgrade individual source packages, but invidual upgrades when bundled with reboots are also fine [13:17:20] still no real investigation or resolution on that, although maybe if we're lucky 4.4 fixes it [13:18:00] ema: if it doesn't break with the first host, it should be no problem [13:18:05] right [13:18:30] if 4.4 doesn't fix it, they all need help at the grub prompt, to add modprobe_blacklist=ipmi_si just to boot. and the ones already booted that way might be incapable of commandline reboot this time regardless, needing racadm-based reboot at the end. [13:18:32] * ema stares at gtk packages on cp* hosts and sheds a tear [13:18:35] I don't expect trouble, but it's the only risky change due to our kernel discepancy [13:18:46] I think I filed a but for that [13:18:50] bug [13:18:53] let me find it [13:18:54] what brings them in? [13:19:17] https://phabricator.wikimedia.org/T127054 [13:19:33] oops typo: modprobe.blacklist=ipmi_si [13:20:18] this can probably be fixed in the late_install script in d-i, but haven't tested that yet [13:21:43] relatedly, authdns and lvs boxes aren't up to date on packages either, and authdns is still back on 3.19 too [13:22:34] several hosts with reboot problems in codfw have been fixed by Papaul with an idrac update [13:22:44] In ran into those with the last kernel updates [13:22:53] is it pinentry-gtk2 bringing in the gtk dependencies? [13:23:51] https://phabricator.wikimedia.org/T130008 [13:23:59] ema: yes [13:24:32] I'm not sure which exact idrac version Papaul installed, though [13:29:54] ema: want to try a test host and make sure about initramfs-tools? maybe a text or upload node in eqiad? [13:30:08] (with all packages upgraded + 4.4 install + reboot) [13:30:37] sounds good! [13:31:00] 10Traffic, 10MediaWiki-Interface, 6Operations: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2183332 (10BBlack) 5Open>3Resolved I'm guessing by now they're all naturally expiring out anyways since there's no further feedback. [13:31:58] let's say cp1052 (text) [13:32:26] moritzm: apt-get upgrade first and then apt-get install linux-image-4.4.0-1-amd64? [13:32:50] this way we can see how well the new initramfs-tools and our kernel play together [13:33:12] 10Traffic, 6Operations, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183337 (10BBlack) p:5Low>3Normal We didn't end up keeping SPDY disabled, and HTTP/2 is coming. From our end, this is a relatively simple change now, but t... [13:33:39] +1 [13:35:57] ema: either that or simply install linux-meta (which will pull in the latest package) [13:35:57] 10Traffic, 6Operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183342 (10BBlack) We actually dug further into related issues when investigating WDQS woes on cache_misc, and the problem is different than what we thought we understood i... [13:36:20] 10Traffic, 6Operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183343 (10BBlack) [13:36:22] 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2183344 (10BBlack) [13:36:31] cool [13:36:49] alright, depooling cp1052 then [13:38:59] 10Traffic, 6Operations, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183349 (10BBlack) [13:39:02] 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2183350 (10BBlack) [13:39:18] moritzm: is there a phab tracking kernel upgrades? I could only find https://phabricator.wikimedia.org/T126320 [13:41:01] 7HTTPS, 10Traffic, 10Citoid, 6Operations, and 3 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#2183351 (10BBlack) 5Open>3Resolved a:3BBlack When we moved various *oid to the text cluster as part of the parsoidcache decom, they got forced to HTTPS-o... [13:42:05] 10Traffic, 6Operations: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#2183359 (10BBlack) 5Open>3declined We've made decisions about this in the past already and moved past this idea. The general direction is to always use LVS for multi-host varnish backends, and solve HTTPS iss... [13:42:23] ema: not yet, let me create one [13:42:30] thanks [13:43:33] https://phabricator.wikimedia.org/T131928 [13:44:47] I'm going to try upgrading baham and see how that flies (authdns latest packages + 4.4), once cp1052 comes back ok [13:47:38] apt-get upgrade done on cp1052 [13:47:55] linux-meta recommends firmware-linux-free, do we need it? [13:48:18] moritzm, bblack: ^ [13:48:35] not sure. I haven't installed any recommended-only packages in the past [13:48:47] let's see what happened on LVS [13:49:12] it's not installed on cp1071 (running 4.4) so I assume we don't need it [13:49:36] root@lvs1010:~# dpkg-query -s firmware-linux-free [13:49:37] dpkg-query: package 'firmware-linux-free' is not installed and no information is available [13:49:40] yeah same on lvs1010 [13:50:09] alright! 4.4 installed, ready to reboot [13:52:23] reboot in progress [13:52:52] so when we get around to the part about rebooting every cache... [13:53:27] what I've done in the past is generated a hostname list, and then randomly shuffle the file around so we're a little less likely to hit the same cluster+dc for multiple hosts in a row [13:55:06] and then do a while loop over that list that goes through them one at a time on neodymium, using confctl to depool it, salt command to downtime the host in icinga for 15 minutes, then salt it a reboot command (using "at" to defer it, so the salt command returns immediately and correctly), then sleep before doing the next [13:55:10] cp1052 is back up [13:55:17] and set the sleep time to give enough spacing for recovery and so-on [13:55:33] where that falls apart is when more than just one or two nodes fail to reboot cleanly on their own, as might be the case for esams still... [13:57:53] uh nginx didn't start properly on cp1052 [13:58:11] did puppet finish running before you checked? [13:58:27] yup [13:58:34] Apr 06 13:55:48 cp1052 nginx[3492]: nginx: [emerg] chown("/var/lib/nginx/body", 33) failed (1: Operation not permitted) [13:58:44] yeah [13:58:49] that's a race, nice [13:59:23] hmmmmm [13:59:33] it has to do with out /var/lib/nginx mount [13:59:52] and the new systemd sec unit file [14:00:07] let me poke at it a bit [14:00:12] go ahead [14:01:41] ok fixed temporarily there [14:01:50] we need CAP_CHOWN in our sysd sec file for nginx [14:02:15] it wasn't apparent in earlier testing because we only restarted the service. rebooting the host wipes out the contents of the /var/lib/nginx tmpfs mount, resetting already-set perms [14:02:29] uploading a patch... [14:02:49] nice [14:07:21] I'm salting puppet on all the caches now to pick up the nginx+sysd fixup [14:08:20] also, about to try baham [14:08:28] varnishkafka-webrequest also had something to complain about [14:08:41] 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183411 (10RobH) >>! In T131880#2182880, @Gehel wrote: > @RobH I'd really appreciate if you could let me do the reclaim / reinstall so that... [14:08:42] Apr 06 13:56:27 cp1052 varnishkafka[1346]: KAFKAERR: Kafka error (-195): kafka1014.eqiad.wmnet:9092/14: Failed to connect to broker at [kafka1014.eqiad.wmnet]:9092: Network is unreachable [14:09:13] however it was running properly from systemd's point of view [14:09:23] well was that a temporary error? [14:09:51] not sure, the last log was an error. I'd expect a reassuring message afterwards in case of temporary issues? [14:11:20] 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183429 (10BBlack) There's no real need to reinstall them. I have patches pending to put them into their proper roles, etc. [14:12:13] I wouldn't expect anythign sane out of vk [14:12:21] does it show a good conn now in lsof? [14:12:42] if that was a fatal error with no retry, you'd think the daemon would just exit [14:13:13] 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183432 (10BBlack) The patch series starts at: https://gerrit.wikimedia.org/r/#/c/268236/ , but needs manual rebases at this point. [14:13:21] 10Traffic, 10DNS: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2183433 (10Nemo_bis) [14:13:48] 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183446 (10BBlack) (it's better to look at T109162, that had all the patch links) [14:13:57] yeah connections look good [14:15:28] ok starting on baham... [14:15:38] repooling cp1052 then [14:19:47] heh, ferm on authdns (and probably others) has bad startup dependency loops [14:22:41] well, there are other issues too, I don't think any of it's from the new packages/kernel, we've seen it before [14:23:32] did you run into https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796611 ? [14:23:48] I'm considering to backport the systemd unit to our jessie package [14:24:54] I don't know... [14:25:06] no, I don't think it's that, but not 100% sure [14:25:47] the apparent root-most problem I see on baham (and this was true last time I rebooted any of our authdns boxes IIRC) is that on startup, it thinks network is up and starts all kinds of services, but the network isn't working [14:26:10] I can ping some nearby hosts, e.g. from baham (208.80.153.13), I can ping -n on 153.14 [14:26:20] but the ipv4 default route is to 153.1, and I can't ping that [14:26:36] all the other services start anyways, can't resolve hostnames because they can't reach our DNS recursors, etc... [14:26:56] eventually several minutes later, pings to the def gw start working, and now I have to go restart a bunch of borked services [14:27:25] ferm in particular is a victim because it needs to do DNS lookups to resolve hostnames for rules [14:27:47] but I don't know if that's also separately a chicken and egg thing, if we're trying to start ferm before the interfaces are online or something [14:27:52] (or somehow indirectly related) [14:28:08] right, I also ran into this on radon with an earlier reboot [14:28:22] it just magically starts to work after a minute of digging :-) [14:39:40] the fact that no iptables rules were defined on baham depends on the fact that ferm didn't do its thing on boot I guess [14:40:09] yeah [14:40:11] I started ferm now [14:40:22] ferm couldn't resolve hostnames it needs for rules [14:40:33] so things started "working" for sure around :25, maybe a little earlier [14:40:36] syslog around there is: [14:40:38] Apr 6 14:24:37 baham systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 1785 (check_disk) [14:40:41] Apr 6 14:24:37 baham systemd[1]: Mounting Arbitrary Executable File Formats File System... [14:40:44] Apr 6 14:24:37 baham systemd[1]: Mounted Arbitrary Executable File Formats File System. [14:40:47] Apr 6 14:24:37 baham rsyslogd0: action 'action 1' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/0 ] [14:40:50] Apr 6 14:24:37 baham rsyslogd-2359: action 'action 1' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/2359 ] [14:40:53] Apr 6 14:25:01 baham CRON[1806]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) [14:40:56] Apr 6 14:25:01 baham ntpd_intres[1441]: DNS chromium.wikimedia.org -> 2620:0:861:2:208:80:154:157 [14:40:59] so at 25:01, DNS is working again because routing is working again [14:41:09] does that automount for binfmt_misc have something to do with all of this indirectly? [14:41:47] the 25:01 timestamp could be after some retry-delay for ntpd_initres [14:41:56] I wonder if rsyslogd triggered that mount, or responded to it? [14:44:19] bblack: uh the varnishkafa-webrequest problem on cp1052 is likely related to what we're seeing on baham [14:44:41] as in: systemd thinks the network is up, starts varnishkafka, [kafka1022.eqiad.wmnet]:9092: Network is unreach [14:44:47] *able [14:45:22] yeah [14:45:31] in that case however the automount for binfmt_misc happens at 13:56:07 [14:45:44] and at 13:56:27 the network is still unreachable [14:45:47] it just doesn't take so long to fix itself on caches for whatever reason [14:46:38] lldpd config is off on baham too, and it complains on startup. surely not related? [14:47:22] 10Traffic, 6Labs, 6Operations, 10Tool-Labs, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10zhuyifei1999) [14:50:52] lldpd config seems to be borked lots of places, hmmmm [14:51:02] anyways, I have an interview to do on the hour [14:51:44] restarted ntp on baham, that was the only thing red in icinga, should recover itself eventually now [14:51:52] back in an hour or so [14:52:02] see you later [14:53:16] 10Traffic, 6Labs, 6Operations, 10Tool-Labs, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10valhallasw) Does Wikipedia Zero include non-wikipedia domains? I would expect tools.wmflabs.org to fall out of scope. [14:53:37] before we start on the reboots, might ping papaul about whether we can query iDRAC rev and/or fix it on the way [14:53:40] for the esams issues [14:53:57] but IM HO we can go ahead with the package updates + kernel installs on them all for now and sort out the reboots after [14:54:42] alright then I'll go ahead with the updates on all cp* hosts [15:00:59] 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183560 (10RobH) @gehel: Since this isn't going to end up being a reinstall, I'll ping you to do a reinstall on one of the many I do every w... [15:14:49] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183605 (10mmodell) I'm sure we could hack the Jenkins job to use https but the staging... [15:20:48] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183636 (10mmodell) Why is labs blocked from connecting to ssh? Is that to avoid people... [15:21:32] all cp* hosts upgraded to jessie 8.4 and kernel 4.4 [15:22:21] yay [15:22:37] I'll probably wait a couple days on the other 2x authdns, since this is the first authdns+4.4 on baham [15:23:29] (plus, I'd like to have better plans or ideas before the next authdns reboot about wtf is going on with network...) [15:27:00] 10netops, 10Analytics-Cluster, 6Operations, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183653 (10RobH) a:5RobH>3None Yes, I think we need a network admin to investigate the dhcp ability of the analytics vlan to carbon, as I cannto seem to... [15:29:02] right [15:29:07] bblack: on cp1008 I've installed the upgrades with apt-get install instead of upgrade to avoid downgrading nginx [15:29:18] ok, awesome [15:57:39] oh look, another meeting :P [15:57:54] lol [15:58:57] so yeah on the misc-web stuff, I guess there's a bunch of horrible repetitive varnish4 conditionals if we leave the VCL stuff as-is? [15:59:05] correct [16:00:02] ok I'll poke around a bit later today, see if there's an "easy" way to get halfway to https://phabricator.wikimedia.org/T110717 from where we are now [16:00:48] maybe start adding new keys to the existing $app_directors for misc or something, which can define the incoming req.http.host matches and pass/TLS behavior, et.c.. [16:01:00] nice. Strictly for misc, a data structure with host_eq or path_re mapping to a backend name would be enough [16:01:16] but that's certainly not as ambitious as T110717 :) [16:02:19] in the general case the stuff in $app_directors is a good candidate for hiera, but I think I avoided that for now because there's still so much refactor going on... [16:03:35] perhaps we could start by adding host_eq|path_re to app_directors for now and then think of a good way to refactor that into yaml? [16:05:10] I think the main holdup on hieradata yaml even for the existing app_directors is the be_opts stuff [16:05:20] 'be_opts' => merge($app_def_be_opts, { 'port' => 8080 }), [16:05:21] and such [16:06:06] obviously, we can 'fix' that by copying the defaults to every stanza... [16:06:27] not that pretty though [16:07:11] before we had the layers of defaulting, and the "merge" happened down in the VCL templates, I just undid that in some earlier refactors which left us with the merge() up in puppet [16:33:06] DegradedArray event on /dev/md/0:cp1052 [16:33:08] known? [16:33:16] it was recently-rebooted [16:33:31] probably the systemd thing again... [16:33:39] no icinga alert though [16:34:08] I think we've seen this before, with a race between starting md arrays and the detection of sdX disks coming online [16:35:45] don't ask me what the real real cause is [16:35:54] yup [16:36:05] Apr 06 13:54:45 cp1052 kernel: md/raid1:md0: active with 1 out of 2 mirrors [16:36:08] Apr 06 13:54:45 cp1052 kernel: md0: detected capacity change from 0 to 9990832128 [16:36:11] Apr 06 13:54:45 cp1052 kernel: ata2.00: Enabling discard_zeroes_data [16:36:14] Apr 06 13:54:45 cp1052 kernel: sd 1:0:0:0: [sdb] Attached SCSI removable disk [16:36:22] yeah [16:36:34] in syslog it has precise timestamps too, the race is tiny [16:37:24] I wonder what's up with that, since it seems to be at the kernel level before systemd gets involved [16:37:34] a flag in the md config on disk to wait for all disks or something? [16:40:37] anwways, fixed for now with: mdadm --manage /dev/md0 --add /dev/sdb1 [16:40:43] and now it's rebuilding the mirror [16:40:51] alright [16:41:50] I'm sure it's ultimately systemd related somehow. or at least udev related which is systemd related [16:42:07] it smells of systemd. always racing to start everything as fast as it can in parallel, nevermind all the b0rkage [16:44:37] systemd needs a config flag for "I don't give a fuck about booting 30 seconds slower, please just boot everything correctly" :P [16:44:47] :) [16:44:53] but that's the kernel right [16:45:15] well it's the kernel running stuff from initramfs for udev rules and udev and systemd are in bed with each other [16:45:18] I consdier it all related [16:45:59] oh in that sense [16:47:49] 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, and 2 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2184019 (10Jdlrobson) [16:50:17] I remember having these issues before systemd and rootdelay=X was the workaround [16:51:43] as in not exactly like this but weird race conditions while booting [16:59:33] hmmm [17:00:13] in the syslog it seems like the basic order of events is "sda detected -> part of md0 -> start md0 (without sdb, because it's not here yet) -> sdb detected" [17:00:31] yep [17:00:32] so it seems like there should be some kind of udev+md solution for "wait a bit for all your disks to show up"... [17:01:22] I mean at first glance you'd say "have md not start until all disks show up", but then if a disk really is missing you don't even get a degraded start [17:01:32] so something, something has to have a timeout where it gives up and just uses the disks it has [17:03:08] in the general case there are probably edge cases like md devices which have one or more disks on pluggable storage the user might not plug in for hours [17:09:19] one thread on the internet suggests renaming them in mdadm.conf from /dev/md/0 to /dev/md0 : https://bugs.archlinux.org/task/33851#comment106076 [17:09:36] something about some initramfs thing that only knows about the /dev/mdX style names [17:09:59] and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=633024 [17:11:06] yeah I was reading the debian bug as well [17:11:15] As the documentation explains, rootdelay delays between SCSI scan and [17:11:17] mdadm / LVM assembly / scan [17:11:38] yeah [17:11:44] I mean, it's a viable workaround, I just hate it [17:11:53] sticking sleeps in places instead of actually fixing the real dependency issue [17:11:59] not cool [17:15:13] one of the installserver changes we have from trusty to jessie is: [17:15:14] -d-idebian-installer/add-kernel-optsstringelevator=deadline rootdelay=90 [17:19:33] there's a /lib/systemd/system/mdadm-last-resort@.timer [17:19:41] Timer to wait for more drives before activating degraded array. [17:19:57] hmmm [17:20:09] as well as /lib/systemd/system/mdadm-last-resort [17:20:16] sorry, /lib/systemd/system/mdadm-last-resort@.service [17:20:25] Activate md array even though degraded [17:20:51] but surely this isn't for the rootfs [17:22:42] bblack: have you seen a race condition like this with 3.x kernels? [17:24:23] I think so, but I'm not 100% sure [17:31:56] FWIW, for the recent mass reboots of 3.13 and 3.19 I only ran into this with authdns/radon [17:32:50] another workaround, slightly better than rootdelay, could be scsi_mod.scan=sync [17:52:20] 10netops, 10Analytics-Cluster, 6Operations, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184248 (10faidon) The port was also on the labs-instance-ports interface-range, which set the port-mode to trunk (and also added labs-instances1-eqiad to t... [18:01:11] today is one of those days where I have too many open windows / patches / investigations. I'm just getting bogged down in the multitask churn. [18:02:30] we should chase down all these bootup errors before we reboot all the things for 4.4 though [18:02:38] save a lot of pain this time around and down the road [18:03:19] yeah [18:03:21] I'd start with seeing if scsi_mod.scan=sync and/or rootdelay=10 or something fixes this one. if it works as a one-off, find the right way to puppetize it and play nice with grub customization for debian [18:03:38] depool a server and reboot it like 10 times and see if the fix sticks? [18:03:46] perhaps cp1008? [18:04:08] if we manage to reproduce the issue there, of course [18:04:55] I'll create a phab task for both boot issues in the meantime [18:05:29] ok thanks [18:16:39] 10netops, 10Analytics-Cluster, 6Operations, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184309 (10RobH) Ok, multiple attempts have still resulted in no joy (no dhcp request hitting carbon.) The system was also showing in the config in the def... [18:20:38] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165259 (10Andrew) > Why is labs intentionally blocked from connecting to ssh? Can you... [18:29:45] mmh [18:29:47] Apr 06 13:54:46.849944 cp1052 lldpcli[919]: unknown command from argument 1: `#` [18:30:19] 10Traffic, 10DNS, 10Mail, 6Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2184350 (10BBlack) So, the gerrit change is held up on comments about `mx ?all` vs `mx -all`. Are we confident phab emails only come from our mxes? ping @chasemp and... [18:30:32] ema: yeah I saw that earlier [18:30:49] but apparently lldpd isn't doing much pretty much everywhere. I thought it was at one point in the past... [18:30:57] something's gone off the rails with our puppetization/config of it [18:31:28] I was wondering if that might somehow indirectly related to our "can't ping the default gateway" thing... [18:33:24] bblack: has that happened in the past or is the 4.4/point release upgrade of today the likely cause? [18:33:36] it's happened before 4.4 [18:33:44] ah there you go [18:33:46] on the caches rebooting 3.19 -> 3.19 [18:34:02] (the general can't ping default gateway thing, I mean) [18:34:09] sorry s/caches/authdns servers/ [18:34:29] on the authdns servers that bad network state persists for minutes anyways. maybe it happens elsewhere for shorter windows of time. [18:34:50] OK, because interestingly enough on cp1052 we also saw "network issues" with varnishkafka [18:35:07] but on the otherhand ipsec seems to have worked fine [18:35:22] yeah [18:35:37] could be separate, something to do with ipv6 for kafka maybe, I saw something funky relatedly once [18:36:07] in other news, the new lvs salt grains work. you can do things like: salt --out=raw -v -t 5 -b 1000 -G lvs:primary cmd.run '....' [18:36:14] ditto -G lvs:secondary [18:36:26] since so many things, we do them on the secondaries then primaries or whatever to be careful [18:36:39] I got tired of having to list out the hostnames among the set of -G cluster:lvs [18:38:43] oh nice [19:35:25] 7HTTPS, 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#1830027 (10BBlack) Note there are probably question-marks around these about insecure requests. We don't yet block/deny insecure POST traffic ( T105794 ), but we'... [19:44:55] 7HTTPS, 10Traffic, 6Operations, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184631 (10BBlack) So, we've had the API warning up for a couple of months now. In general, we've continually fallen behind on promises to notify -> kill insecure... [19:51:24] 7HTTPS, 10Traffic, 6Operations, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184639 (10konklone) @BBlack If you want someone to remind you about it, I am happy to volunteer. ;) [20:41:41] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2184719 (10chasemp) 22 to only 208.80.154.250/32 as the service address for git-ssh shou...