[00:03:15] * bd808 gets a crash course in WMF LVS via the magic of backscroll [00:21:05] !log added #wikimedia-traffic channel to stashbot config, test [00:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:19] bblack: ^ done :) [00:21:25] you can log here now [00:22:44] how: ssh root@tools-login.wmflabs.org ; become stashbot ; vi ./etc/config-j8s.yaml ; ./bin/stashbot.sh stop/start [00:22:47] mutante: you figured out my wacky config file scheme? :) [00:22:53] seems like it :) [00:23:17] i copied #wikimedia-fundraising which is "use_config: '#wikimedia-operations'" [00:23:26] so just like -ops channel [00:23:48] bd808: i assume right that the config file itself isn't in git, right [00:23:48] oh yeah. I forgot about that magic [00:24:30] yeah, it's not. I have some task somewhere to separate secrets from plain config so most of it could be versioned [00:24:43] ok, yep [00:25:01] probably not going to happen any time soon though [00:26:24] imagines one day we have an actual prod machine for bots [00:27:01] (for the ones we actually rely on) and then we puppetize it with passwords in private repo [00:27:42] yeah. stashbot actually runs as a k8s pod so it could go in the prod k8s cluster once it exists [00:27:51] *nod* [00:28:03] it sends data to tool labs boxes though which would be difficult to do from prod [00:28:24] for https://tools.wmflabs.org/sal/ [00:28:45] that could be in prod too though. its a trivial php app [07:09:57] mutante: thanks, that's nice! [07:31:37] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3159658 (10ema) We're currently running with a [[ https://gerrit.wikim... [08:23:15] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3159811 (10Beetlebeard) 05Resolved>03Open [08:23:33] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Beetlebeard) Dear Dzahn I am so sorry but we would like to go back to the old system. Everyone using @wikimedia.ee addresses had to... [08:31:21] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10MoritzMuehlenhoff) @Beetlebeard If cost is the primary issue; G Suite is free of charge for non-profits: https://www.google.com/non... [08:41:45] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3159822 (10Beetlebeard) Thanks Moritz. The others using @wikimedia.ee addresses still prefer to change back. They just liked the old system... [08:45:40] fascinating, I'm not really sure now that the swift issue is due to swift-object, nor that it depends on swift's version [08:46:03] I'm trying some IMS requests on ms-be1027, running 2.2.0 [08:46:18] and I get 304 responses with the proper CT [08:46:34] yeah, lol, 2.2.0 [08:46:39] ignore me [08:48:01] rotfl [09:33:30] 10Traffic, 06Operations, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159891 (10ema) [09:40:12] 10Traffic, 06Operations, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159922 (10ema) p:05Triage>03High [09:40:50] 10Traffic, 06Operations, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159891 (10ema) [09:50:15] 10Traffic, 06Operations, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159942 (10ema) [09:59:24] ema: I've add a comment to the paste patch, not sure if Phab notifies for those things ;) [09:59:29] just FYI [10:01:31] volans: thanks, yes phab does notify also on paste comments apparently! [10:01:45] good to know [11:56:35] moritzm: not going to well with the cache upgrades to 4.9! cp2006 seems to be stuck on: [11:56:38] A start job is running for LSB: Raise network interf...11s / no limit) [11:57:24] oh no, now it made it [11:57:35] that took a pretty long while! [11:57:58] let me look at the logs [11:58:41] FTR, I was keeping a ping open to see when/if the host would come back online [11:58:59] it's been offline between 11:52:35 and 11:57:18 [11:59:21] see "systemd-analyze blame", it took three minutes to bring up the network [11:59:32] fun! [12:00:17] e.g. on cp4011 this took 1.062 seconds [12:00:48] systemd-analyze blame prints a list of all running units, ordered by the time they took to initialize. [12:00:51] wow! [12:01:17] didn't know the command, really useful [12:01:19] Apr 06 11:54:16 cp2006 systemd[1]: Starting LSB: Raise network interfaces.... [12:01:22] Apr 06 11:57:17 cp2006 networking[831]: Configuring network interfaces...net.ipv4.conf.all.arp_ignore = 1 [12:01:27] it's pretty cool, it even has an option to make a DOT file for Graphviz! [12:03:04] there's a hung process in syslog [12:03:15] systemd-udevd? [12:04:11] probably during device discovery or so? [12:07:10] loading a module it seems: load_module -> ... -> intel_uncore_init+0x1c1/0xeaa [intel_uncore] [12:07:31] https://groups.google.com/forum/#!topic/fa.linux.kernel/I58lsXR9DOU : [12:07:45] "While working on the hotplug rewrite I stumbled over the uncore drivers. The intel_uncore driver particular is a complete trainwreck" [12:07:51] yay! [12:10:38] which is good given that it's responsible for mundane stuff like L3 cache, QPI, on-die memory controller... [12:11:08] it's a loadable module, we could rmmod intel_uncore [12:12:09] moritzm: what does the module do? [12:12:11] I wonder what that driver even does [12:12:13] heh [12:12:18] hehe [12:12:43] according to Kconfig: "Include support for Intel uncore performance events. These are available on NehalemEX and more modern processors" [12:12:52] oh so for perf? [12:12:55] seems limited to performance monitoring [12:13:19] yeah, the Kconfig variable is called PERF_EVENTS_INTEL_UNCORE [12:13:42] I'd say let's see whether this also occurs on a server with identical hardware [12:13:46] OK yeah I guess we could load it when needed rather than by default [12:13:48] and if so, blacklist the module [12:13:52] apparently it helps with device/power management for potential other uncore bits like embedded DRAM or i915 graphics too [12:14:22] (when those are embedded in the CPU die) [12:16:21] cp2005 has no uncore loaded [12:16:36] is it (or the loading of it) new since 4.4? [12:17:19] I just checked, seems also present in 4.4, but having a closer look [12:18:04] looks like it's loaded only on the cache hosts already upgraded to 4.9 [12:19:33] (btw the cp2xxx hosts, unlike most sites, were purchased in a single batch and all hardware-identical) [12:19:49] nice, I'll try another cp2* host then [12:20:36] it's built-in 4.4, it didn't allow to build is as a module [12:20:44] but it was already enabled [12:22:21] alright, so perhaps part of the trainwreck is the interaction with udev [12:23:27] cp2009 rebooting, let's see [12:25:45] heh, I think it crashed [12:25:59] there's some random gibberish in console and then [12:26:06] [ 102.102386] Call Trace: [12:26:06] [ 102.105119] [] ? vprintk_emit+0x31c/0x4f0 [12:26:06] [ 102.111437] [] ? rcu_note_context_switch+0xb8/0xc0 [12:26:09] [ 102.118622] [] ? _raw_spin_lock+0x1d/0x20 [12:26:12] [ 102.124938] [] ? __schedule+0x94/0x6d0 [12:26:14] [ 102.130965] [] ? schedule+0x32/0x80 [12:26:17] [ 102.136692] [] ? do_exit+0x970/0xb50 [12:26:19] [ 102.142526] [] ? rewind_stack_do_exit+0x17/0x20 [12:26:49] and a little above: [12:26:50] [ 102.045578] INFO: rcu_sched detected stalls on CPUs/tasks: [12:26:50] [ 102.051716] 0-...: (2 GPs behind) idle=be9/140000000000000/0 softirq=200/200 fqs=10162 [12:27:06] will turn it off and on again [12:30:10] ok it came up fine [12:30:36] (except for the crash above, probably still when shutting down for the reboot?) [12:31:24] 1.141s networking.service [12:31:33] Startup finished in 13.492s (kernel) + 52.624s (userspace) = 1min 6.117s [12:33:37] to rule out an actual hardware defect we could test whether stopping/starting the network on 2006 also requires three minutes? [12:34:16] BTW, intel_uncore seems in use on 2006 but not on 2009 [12:37:03] I'm having a look at other hosts with Linux 4.9 [12:37:27] moritzm: how about rebooting cp2006 again? HW issues might be more likely to show up on device init as we've seen with the other unlucky hosts [12:37:55] ema: makes sense [12:38:14] we have 49 other hosts with intel_uncore, which don't seem to have exposed any problems so far [12:39:00] well [12:39:20] the bulk (I think is still true?) of our hardware is Dell-based. These cp2xxx are HP [12:39:24] I think? [12:39:30] wait maybe we stuck with dell there too [12:39:45] maybe it was just the lvs in codfw that went HP [12:40:17] according to racktables it's dell [12:40:20] (2006) [12:40:24] yeah dmesg too, ignore me above :) [12:40:31] Dell PowerEdge R630 [12:40:54] but this might just as well be CPU specific [12:41:01] or microcode dependant [12:41:24] * moritzm really needs to get to https://phabricator.wikimedia.org/T127825 [12:41:32] ema: have you done any esams or eqiad 4.9's yet? [12:42:11] the one in esams exposed a hardware problem [12:42:27] in terms of "generation of hardware orders from dell", these cp2 are from the same approximate timeframe as all of esams cp30[34][0-9] and also just cp107[1-4] in eqiad [12:42:29] https://phabricator.wikimedia.org/T162132 [12:43:15] right, I've done some ulsfo and codfw so far, the only attempt in esams failed (T162132) [12:43:15] T162132: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132 [12:43:38] (so cp2 and the others mentioned above might come out similar when it comes to cpu model specifrics, various microcode or firmware revs, etc) [12:43:47] cp3003 is much older/different [12:43:49] OK cp2006 rebooted without issues [12:49:16] bblack: so perhaps it would be good now to try upgrading cp3007, which might be comparable to cp3003? [12:49:42] sure [12:49:54] but in misc rather than maps so that we don't lose too many soldiers there in case of trouble :) [12:53:39] cp3007 is a PowerEdge R620 [12:55:36] all good w/ cp3007 [13:48:10] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3160332 (10ayounsi) From Juniper at about 9am UTC: >Sent by Carrier: UPS >Tracking Number: 1Z223V170461615001 >Tracking URL: http://wwwapps.ups.com/WebTracking/track?track=yes&trackNums=1Z223V170461615001... [13:52:35] bblack: so I gave a quick shot at fixing cp1008's VCL compilation failures with https://gerrit.wikimedia.org/r/#/c/346733/ but that didn't work :) [13:52:58] we're not using dynamic_directors there so the cache_{eqiad,codfw} directors aren't defined [13:53:37] how does that work in labs? Perhaps some conditionals on realm or so? [13:57:13] yeah I think I broke it with the active/active patch, and probably yeah a realm conditional [13:57:43] https://gerrit.wikimedia.org/r/#/c/339667/ [13:58:52] probably L142 here: https://gerrit.wikimedia.org/r/#/c/339667/12/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb [13:59:00] but I haven't looked at the cp1008 fail output [13:59:25] mmh yeah, L142 for sure but there's more [13:59:34] L16 too [13:59:59] ok [14:00:10] yeah in general, the patch tends to assume multi-tier stuff [14:00:10] fail output here https://phabricator.wikimedia.org/P5206 [14:00:28] I'll look at it later if you want, I have an interview thing [14:01:24] bblack: I'm happy to play with it, do you think it'd makes sense to avoid the cache_#{@cache_route} stuff if !dynamic_backends? [14:03:55] I'll stare at it a bit longer :) [14:04:53] honestly I've been trying to get rid of that dynamic backends conditional, I think I killed part of it earlier, but not completely yet [14:05:12] basically labs/cp1008 are the only case. all real prod cahe nodes always have dynamic backends for inter-cache, and don't for applayer [14:05:41] yeah [14:07:27] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3160389 (10Gehel) [14:11:39] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3160407 (10Gehel) Looking at Tasmania on the maps / codfw cluster, it looks like we did not regenerate all tiles after the T159631 incident. This is now in progress.... [15:27:01] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3160606 (10RobH) Done, I've also asked support about the followup part swap, and how we can arrange it. [16:51:52] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3160894 (10dr0ptp4kt) @Dzahn, no luck for me. So when you're at "https://www.google.com/webmasters/tools/home?hl=en" does it say "No new messages or recent criti... [17:20:37] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161007 (10Dzahn) @dr0ptp4kt So you are saying even though i gave you full access to the domain(s) you can't read the associated messages? That seems strange, th... [17:22:21] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161015 (10Dzahn) >>! In T161343#3160894, @dr0ptp4kt wrote: > @Dzahn, no luck for me. What went wrong? Did it say "permission denied" or something? What was th... [17:24:23] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161019 (10Dzahn) @dr0ptp4kt Is this the one you want to approve? " mediawiki-0794@pages.plusgoogle.com would like to associate his YouTube channel MediaWiki to... [17:25:21] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161020 (10dr0ptp4kt) @dzahn, I can review pageview trends, but the console isn't showing me any approval messages to authorize the "MediaWiki" YouTube channel t... [17:26:11] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161022 (10dr0ptp4kt) Race condition! One moment. [18:22:24] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161317 (10dr0ptp4kt) @Dzahn it's unclear if that's the same request; in fact in the YouTube administrator interface the request had been submitted as https://me... [18:36:34] 10Traffic, 06Operations, 05MW-1.28-release (WMF-deploy-2016-08-09_(1.28.0-wmf.14)), 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2216137 (10Volker_E) [[ https://github.com/wikimedia/wikimediablog-wordpresscom/commit/292c01ccb221dbadfb91786675d4d3cb5a2f3f... [18:42:06] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3161381 (10Papaul) Will have a replacement board tomorrow between 10:00am and 1:30 PM Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This emai... [19:11:19] 10netops, 06Operations: cr2-knams<->asw-esams GBLX fiber down - https://phabricator.wikimedia.org/T158647#3161486 (10ayounsi) a:03ayounsi [19:12:02] ema: cp1008 VCL fixed via https://gerrit.wikimedia.org/r/#/c/346813 + https://gerrit.wikimedia.org/r/#/c/346814 [19:14:56] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3161488 (10Dzahn) @Beetlebeard Ok, i have reverted the change. It's back to elkdata as it was before this ticket. Changes should appear within... [19:14:57] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3161489 (10Dzahn) @Beetlebeard Ok, i have reverted the change. It's back to elkdata as it was before this ticket. Changes should appear within... [19:45:08] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161585 (10Dzahn) >>! In T161343#3161317, @dr0ptp4kt wrote: > @Dzahn it's unclear if that's the same request; in fact in the YouTube administrator interface the... [20:34:43] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161691 (10dr0ptp4kt) >> One other approach we might try is updating the YouTube account so that noc@ can become a manager of it > > I would try to avoid that.... [21:31:31] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161892 (10Dzahn) following https://www.google.com/webmasters/verification/verification?siteUrl=https%3A%2F%2Fmediawiki.org%2F as noc@ gets me "You are already...