[08:44:12] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10ArielGlenn) Google gets updates from us more than once a day; I don't know how their updat... [10:41:45] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp2006.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807... [11:07:36] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2006.codfw.wmnet'] ``` and were **ALL** successful. [11:08:55] reimage of cp2006 as stretch smooth and easy! [11:09:53] ema: pro-tip, if you need to reimage hosts in codfw is slightly quicker from sarin ;) [11:11:03] ema: \o/ [11:11:19] ema: is the reimage complete? I need to upgrade cumin on neodymium, and don't want to disrupt you [11:12:47] seems so, proceeding as it takes few seconds [11:15:14] volans: it is, yes [11:15:23] ack, already done, all back yours [11:27:39] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) There's conflicting information about how Google updates their index. On the one... [11:28:53] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) This isn't about search engine optimisation in the strictest sense, but... [11:41:15] vgutierrez: on cp2006 I had to manually upgrade librdkafka1 to 0.11.5-1~bpo9+1 in order for varnishkafka to start properly, I guess we should ensure that happens automatically [11:41:35] host repooled [11:42:12] in the sense that it's not installed from backport by puppet? [11:42:20] volans: correct [11:42:42] ema: it was rejecting the new config parameters? [11:42:46] ok [11:42:49] vgutierrez: also correct :) [11:43:03] ack [11:47:20] that's as expected, stretch has a native librdkafka1, so that's picked unless pinning is used [11:47:35] yey, I was looking for the proper place to do the pinning [11:50:41] elukey: FYI cp2006 (cache_misc) is now running stretch, varnishkafka seems to be working fine [11:52:06] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @ArielGlenn @Deskana AFAICT, they're using ChangeProp to update the inf... [12:00:44] elukey: my definition of vk working fine is "I see cp2006's messages on kafka-jumbo1001 webrequest_misc topic" though, which might not be yours :) [12:00:54] ema: this should do the trick: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449163/ [12:06:53] vgutierrez: ship it! [12:44:50] ema:thanks! [12:45:12] vgutierrez: I am wondering if the pinning would be best in the profile [12:45:53] elukey: hmmm for me makes sense to have it as near as possible to the package installation [12:47:07] and in the profile could look weird... ie: webrequests having the pinning and eventlogging nope [12:47:18] effectively would affect eventlogging as well [12:48:43] ah sorry it is already merged [12:48:48] all right then, +1 :D [12:48:51] hahahah [12:48:52] nevermind :D [12:49:03] it is not written in stone :) [12:49:33] half of the traffic team and Moritz are ok with the change => ok for me too :D [12:49:44] what I just told you are the reasons why I went for that location [12:49:53] yep yep it makes sense [12:50:07] thanks! [12:51:51] so one weird thing is https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=14&fullscreen&orgId=1&from=now-6h&to=now [12:52:09] if I select cp2006, I see nothing for the past hour and a half [12:52:35] is it depooled? [12:54:11] elukey: it is [12:54:24] elukey: sorry, it is not depooled (it's pooled) :) [12:56:03] really strange, logster seems working [12:56:35] even if we have [12:56:36] /usr/bin/logster -o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=varnishkafka.cp2006.webrequest.misc JsonLogster /var/cache/varnishkafka/webrequest.stats.json > /dev/null 2>&1 [12:56:52] so if it fails it might be difficult to know :D [12:57:06] stats are in /var/cache/varnishkafka/webrequest.stats.json [12:57:26] trying to execute logster manually [12:57:48] logster: error: option -o: invalid choice: 'statsd' (choose from 'graphite', 'ganglia', 'stdout') [12:57:51] looool [12:58:32] yeah.. logster-0.0.1 is installed [12:58:35] instead of 0.0.10 [12:58:36] ah so on cp2005 we have 0.0.10-2~jessie1, on cp2006 0.0.1-2 [12:58:39] exactly [12:58:56] I assume is kinda the same issue we just solved for varnishkafka with librdkafka1 [12:59:26] 0.0.10-2~jessie1 seems something that we packaged though [12:59:44] yep [13:00:00] we should package it for stretch as well [13:00:41] I'd say that it is just a matter of removing the ~jessie1, build it for stretch-wikimedia, upload and then install [13:01:04] something like 0.0.10-3 [13:01:31] before you guys ask, I have a plan to add a prometheus agent for varnishkafka [13:01:34] :D [13:02:03] I've just copied logster 0.0.10-2 from jessie-wikimedia to stretch-wikimedia, it works :) [13:03:48] so we leave the ~jessie1 thingy? [13:06:21] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Vgutierrez) [13:06:56] anyhow, I see datapoints now in grafana :) [13:07:10] <3 [13:08:55] elukey: on https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/logster I see 0.0.11-1~trusty1 as the last packaged version [13:09:57] elukey: just dropping ~jessie1 from the version number seems pointless, but perhaps you wanna release 0.0.11? [13:13:38] ema: ah ok it might have been a one off build then, nevermind. 0.11 is not worth it imho, I'd prefer to concentrate on prometheus [13:14:14] k [13:39:31] !log reboot cp2006 for kernel update [13:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:00] carrying on with more cache upgrades to stretch [14:21:31] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014... [14:22:48] ema: I think I'm gonna push some storage changes with stretch too (as in reformat the ext4 differently and use different mount opts) [14:23:10] stuff found while working on cp1075 that helps with the older SSDs too [14:23:35] nice [14:23:46] I'll push it all in a bit, so the next one can have it from the get-go (we might have to depool the ones you've already hit and re-mkfs on them) [14:24:11] bblack: sounds good, I've only upgraded 2006 so far, 2012 in progress (both misc) [14:28:20] using stretch reinstalls is a convenient barrier anyways. technically my commit will make jessie re-installs fail temporarily, but IMHO at this point if anthing needs reinstalling before conversion is done, we just install it as stretch instead [14:28:50] definitely, yes [14:37:15] doh, using the jessie installer considered a bad idea when trying to install stretch [14:38:04] if you hold on a sec before restarting, we can squeeze in testing this mkfs on a traditional node toio [14:38:07] *too [14:38:20] sure [14:38:29] my primary concern is whether the varnish.main1 -level sizing will still be appropriate, or will need tweaking to match [14:39:10] (e.g. 360G may have ot become 359G, or may be able to now be 362G, I have no idea, there's so much rounding error in related things) [14:42:42] did you merge mine? [14:43:07] well anyways it's merged and puppetized to installservers [14:44:48] in our usual spirit of "turn on 2 seemingly-good things in prod without proper low-level isolated perf testing, because who has time?", the mke2fs/mount changes go to the stretch nodes, as does the scsi blk_mq stuff [14:44:59] both seem pretty safe and pretty likely to help though :) [14:45:13] (post-install reboot will net the blk_mq stuff) [14:45:40] oh, the installers are not puppeted with the new stuff [14:45:49] I was confused when I lost the "submit" race without checking :P [14:46:19] bblack: I've puppet-merged the cp2012 pxe patch only [14:47:23] puppeting install[12]002 now [14:47:45] ok, will reimage 2012 once you're done then! [14:47:54] [done] [14:48:17] if the autoinstall fails, it will likely be because I did something stupid in late_command, we'll see! :) [14:48:45] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2012.codfw.wmnet'] ``` [14:48:58] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014... [14:49:11] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2012.codfw.wmnet'] ``` [14:49:45] mmh, starting a new reimage failed with "Signed cert on Puppet not found for hosts ['cp2012.codfw.wmnet'] and no_raise=False" [14:51:19] same host? [14:51:27] ofc, it was removed by the previous one [14:51:38] add --no-verify --no-downtime [14:51:44] volans++ [14:51:59] * volans guess a first reimage failed [14:52:11] but if it failed in the middle we can restart from where it broke more or less [14:52:14] depending on the failure [14:52:17] * volans need more context [14:52:18] volans: I've stopped it halfway through yeah [14:52:39] and you want to redo all the d-i and pxe? [14:52:48] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014... [14:53:27] volans: yes, because (1) due to PEBKAC I've started the reimage with jessie instead of stretch and (2) we're trying different mkfs options [14:53:34] akc [14:53:35] ack [14:56:18] d-i started [14:59:35] bblack: mkfs phase went smoothly, installing base system now [15:01:06] the mkfs in late_command is after the base system [15:01:13] (not part of partman stufF) [15:01:19] ah right! [15:01:51] the command worked fine on cp1075, the concern is my refactoring to use a shell variable to reduce duplication could've cause some stupid sh error [15:04:51] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) @Imarlier That's good info, thanks! [15:09:42] root@cp2012:~# dumpe2fs /dev/sdb3 | grep features [15:09:42] dumpe2fs 1.43.4 (31-Jan-2017) [15:09:42] Filesystem features: resize_inode sparse_super2 filetype extent 64bit flex_bg sparse_super large_file huge_file bigalloc metadata_csum [15:09:49] looks good ^ [15:10:06] nice [15:10:26] fyi, ulsfo is still depooled for last week's links failures, I'll re-pool after the meeting [15:10:42] ok [15:11:23] ema: so the removal of data=writeback from fstab is already puppetized to all stretch caches. it won't fail on the old ones, they'll just not have the mkfs improve [15:12:00] I think we already had reasons to need a post-puppetization reboot on new installs, but in any case the scsi_mod param definitely needs one [15:12:54] so cp2006, should just need to depool, stop varnish.service, unmount /srv/sd[ab]3, run the mkfs commands on sd[ab]3, remount, restart varnish [15:13:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/449201/1/modules/install_server/files/autoinstall/scripts/late_command.sh [15:14:05] and reboot for use_blk_mq I guess [15:14:21] 2006 already picked it up, it was merged earlier than the FS stuff [15:14:26] ah great [15:19:08] so at this point cp1075 looks pretty good. it's depooled from cache_text but otherwise live-ish. [15:19:13] bblack: fs size issues on 2012 it seems [15:19:18] /dev/sdb3 ext4 364G 48M 356G 1% /srv/sdb3 [15:19:30] -s main2=file,/srv/sdb3/varnish.main2,360G [15:19:41] ema: did it fail, or just fails to use all the space? [15:19:55] bblack: varnish-be fails to start [15:20:08] Jul 30 15:18:32 cp2012 varnish[7199]: Error: (-sfile) allocation error: No space left on device [15:20:13] ok taking a look [15:21:31] yeah the size came up slightly-smaller than I expected in the other case too, becaues of the 16M chunks allocation is done in [15:23:55] yeah, so 355G works, previous was 360G [15:24:06] but it seems odd the 16M allocation thing would cause that [15:24:17] (5G difference for 16M alignment?) [15:26:33] eh it's not worth messing with perhaps (or can be some later experiment) [15:26:47] I'll patch to not vary that stuff on stretch, just on new-nvme hosts vs rest [15:28:57] ok [15:31:03] I'll clean up 2012 to match, 2006 should be fine as it is [15:36:17] yeah [15:38:08] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ``` and were **ALL** successful. [15:38:11] bblack: cool, reimage concluded properly now (it was waiting for puppet) [15:41:07] vgutierrez: FYI on cp2012 librdkafka1's pinning worked but the stretch version wasn't automatically installed, probably some other package depending on librdkafka1 installed it before the pinning took effect [15:41:32] vgutierrez: I'm upgrading it by hand now and restarting varnishkafka but we should look into this later (or tomorrow!) [15:42:12] all cleaned up [15:42:41] so, cp1075 is basically in a seemingly-good state, just depooled from text@eqiad as its 9th member [15:42:48] nice [15:43:35] I figure next step on that front is pool it up as #9 and see how it flies, then we can plot out the rest of that hardware set [15:43:48] but maybe we get past doing some cache_text stretch upgrades on existing nodes first [15:44:04] yeah, +1 [15:44:31] for now it can sit and burn in, maybe we find a kernel crash while we're waiting or whatever :) [15:44:52] :) [15:47:19] I'll install the others in the meantime, at least the base OSes [15:47:44] I don't remember if adding a bunch of depooled nodes at the pybal or varnish levels perturbs (c)hashing pointlessly at one of the layers [15:48:42] if chashing does its job, it shouldn't [15:50:09] !log reboot cp2012 for kernel update [15:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:48] 10Traffic, 10DNS, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson) [16:56:32] 10Wikimedia-Apache-configuration, 10Operations, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Krenair) Some of them are so similar to each other I believe we can do a huge merge: https://gerrit.wikimedia.org/r/#/c... [17:59:46] bblack: so far, so good: https://puppetboard.wikimedia.org/fact/default_routes Will wait ~2h then look at pushing the mtu change? [18:02:57] XioNoX: yeah, the ulsfo one [18:03:38] yep [23:09:03] do we have stats on how many requests we get on HTTP? I'm curious about how many people are running either an outdated HSTS preload list, or don't have one at all