[08:44:12] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10ArielGlenn) Google gets updates from us more than once a day; I don't know how their updat...
[10:41:45] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp2006.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807...
[11:07:36] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2006.codfw.wmnet'] ```  and were **ALL** successful.
[11:08:55] <ema>	 reimage of cp2006 as stretch smooth and easy!
[11:09:53] <volans>	 ema: pro-tip, if you need to reimage hosts in codfw is slightly quicker from sarin ;)
[11:11:03] <vgutierrez>	 ema: \o/
[11:11:19] <volans>	 ema: is the reimage complete? I need to upgrade cumin on neodymium, and don't want to disrupt you
[11:12:47] <volans>	 seems so, proceeding as it takes few seconds
[11:15:14] <ema>	 volans: it is, yes
[11:15:23] <volans>	 ack, already done, all back yours
[11:27:39] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) There's conflicting information about how Google updates their index. On the one...
[11:28:53] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) This isn't about search engine optimisation in the strictest sense, but...
[11:41:15] <ema>	 vgutierrez: on cp2006 I had to manually upgrade librdkafka1 to 0.11.5-1~bpo9+1 in order for varnishkafka to start properly, I guess we should ensure that happens automatically
[11:41:35] <ema>	 host repooled
[11:42:12] <volans>	 in the sense that it's not installed from backport by puppet?
[11:42:20] <ema>	 volans: correct
[11:42:42] <vgutierrez>	 ema: it was rejecting the new config parameters?
[11:42:46] <volans>	 ok
[11:42:49] <ema>	 vgutierrez: also correct :)
[11:43:03] <vgutierrez>	 ack
[11:47:20] <moritzm>	 that's as expected, stretch has a native librdkafka1, so that's picked unless pinning is used
[11:47:35] <vgutierrez>	 yey, I was looking for the proper place to do the pinning
[11:50:41] <ema>	 elukey: FYI cp2006 (cache_misc) is now running stretch, varnishkafka seems to be working fine
[11:52:06] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @ArielGlenn @Deskana AFAICT, they're using ChangeProp to update the inf...
[12:00:44] <ema>	 elukey: my definition of vk working fine is "I see cp2006's messages on kafka-jumbo1001 webrequest_misc topic" though, which might not be yours :)
[12:00:54] <vgutierrez>	 ema: this should do the trick: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449163/
[12:06:53] <ema>	 vgutierrez: ship it!
[12:44:50] <elukey>	 ema:thanks!
[12:45:12] <elukey>	 vgutierrez: I am wondering if the pinning would be best in the profile
[12:45:53] <vgutierrez>	 elukey: hmmm for me makes sense to have it as near as possible to the package installation
[12:47:07] <vgutierrez>	 and in the profile could look weird... ie: webrequests having the pinning and eventlogging nope
[12:47:18] <vgutierrez>	 effectively would affect eventlogging as well
[12:48:43] <elukey>	 ah sorry it is already merged
[12:48:48] <elukey>	 all right then, +1 :D
[12:48:51] <vgutierrez>	 hahahah
[12:48:52] <elukey>	 nevermind :D
[12:49:03] <vgutierrez>	 it is not written in stone :)
[12:49:33] <elukey>	 half of the traffic team and Moritz are ok with the change => ok for me too :D
[12:49:44] <vgutierrez>	 what I just told you are the reasons why I went for that location
[12:49:53] <elukey>	 yep yep it makes sense
[12:50:07] <elukey>	 thanks!
[12:51:51] <elukey>	 so one weird thing is https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=14&fullscreen&orgId=1&from=now-6h&to=now
[12:52:09] <elukey>	 if I select cp2006, I see nothing for the past hour and a half
[12:52:35] <elukey>	 is it depooled?
[12:54:11] <ema>	 elukey: it is
[12:54:24] <ema>	 elukey: sorry, it is not depooled (it's pooled) :)
[12:56:03] <elukey>	 really strange, logster seems working
[12:56:35] <elukey>	 even if we have
[12:56:36] <elukey>	 /usr/bin/logster -o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=varnishkafka.cp2006.webrequest.misc JsonLogster /var/cache/varnishkafka/webrequest.stats.json > /dev/null 2>&1
[12:56:52] <elukey>	 so if it fails it might be difficult to know :D
[12:57:06] <elukey>	 stats are in /var/cache/varnishkafka/webrequest.stats.json
[12:57:26] <elukey>	 trying to execute logster manually
[12:57:48] <elukey>	 logster: error: option -o: invalid choice: 'statsd' (choose from 'graphite', 'ganglia', 'stdout')
[12:57:51] <elukey>	 looool
[12:58:32] <vgutierrez>	 yeah.. logster-0.0.1 is installed
[12:58:35] <vgutierrez>	 instead of 0.0.10
[12:58:36] <elukey>	 ah so on cp2005 we have 0.0.10-2~jessie1, on cp2006 0.0.1-2
[12:58:39] <elukey>	 exactly
[12:58:56] <vgutierrez>	 I assume is kinda the same issue we just solved for varnishkafka with librdkafka1
[12:59:26] <elukey>	 0.0.10-2~jessie1 seems something that we packaged though
[12:59:44] <vgutierrez>	 yep
[13:00:00] <vgutierrez>	 we should package it for stretch as well
[13:00:41] <elukey>	 I'd say that it is just a matter of removing the ~jessie1, build it for stretch-wikimedia, upload and then install
[13:01:04] <elukey>	 something like 0.0.10-3
[13:01:31] <elukey>	 before you guys ask, I have a plan to add a prometheus agent for varnishkafka
[13:01:34] <elukey>	 :D
[13:02:03] <ema>	 I've just copied logster 0.0.10-2 from jessie-wikimedia to stretch-wikimedia, it works :)
[13:03:48] <elukey>	 so we leave the ~jessie1 thingy?
[13:06:21] <wikibugs>	 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Vgutierrez)
[13:06:56] <elukey>	 anyhow, I see datapoints now in grafana :)
[13:07:10] <vgutierrez>	 <3
[13:08:55] <ema>	 elukey: on https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/logster I see 0.0.11-1~trusty1 as the last packaged version
[13:09:57] <ema>	 elukey: just dropping ~jessie1 from the version number seems pointless, but perhaps you wanna release 0.0.11?
[13:13:38] <elukey>	 ema: ah ok it might have been a one off build then, nevermind. 0.11 is not worth it imho, I'd prefer to concentrate on prometheus
[13:14:14] <ema>	 k
[13:39:31] <ema>	 !log reboot cp2006 for kernel update
[13:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:00] <ema>	 carrying on with more cache upgrades to stretch
[14:21:31] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014...
[14:22:48] <bblack>	 ema: I think I'm gonna push some storage changes with stretch too (as in reformat the ext4 differently and use different mount opts)
[14:23:10] <bblack>	 stuff found while working on cp1075 that helps with the older SSDs too
[14:23:35] <ema>	 nice
[14:23:46] <bblack>	 I'll push it all in a bit, so the next one can have it from the get-go (we might have to depool the ones you've already hit and re-mkfs on them)
[14:24:11] <ema>	 bblack: sounds good, I've only upgraded 2006 so far, 2012 in progress (both misc)
[14:28:20] <bblack>	 using stretch reinstalls is a convenient barrier anyways.  technically my commit will make jessie re-installs fail temporarily, but IMHO at this point if anthing needs reinstalling before conversion is done, we just install it as stretch instead
[14:28:50] <ema>	 definitely, yes
[14:37:15] <ema>	 doh, using the jessie installer considered a bad idea when trying to install stretch
[14:38:04] <bblack>	 if you hold on a sec before restarting, we can squeeze in testing this mkfs on a traditional node toio
[14:38:07] <bblack>	 *too
[14:38:20] <ema>	 sure
[14:38:29] <bblack>	 my primary concern is whether the varnish.main1 -level sizing will still be appropriate, or will need tweaking to match
[14:39:10] <bblack>	 (e.g. 360G may have ot become 359G, or may be able to now be 362G, I have no idea, there's so much rounding error in related things)
[14:42:42] <bblack>	 did you merge mine?
[14:43:07] <bblack>	 well anyways it's merged and puppetized to installservers
[14:44:48] <bblack>	 in our usual spirit of "turn on 2 seemingly-good things in prod without proper low-level isolated perf testing, because who has time?", the mke2fs/mount changes go to the stretch nodes, as does the scsi blk_mq stuff
[14:44:59] <bblack>	 both seem pretty safe and pretty likely to help though :)
[14:45:13] <bblack>	 (post-install reboot will net the blk_mq stuff)
[14:45:40] <bblack>	 oh, the installers are not puppeted with the new stuff
[14:45:49] <bblack>	 I was confused when I lost the "submit" race without checking :P
[14:46:19] <ema>	 bblack: I've puppet-merged the cp2012 pxe patch only
[14:47:23] <bblack>	 puppeting install[12]002 now
[14:47:45] <ema>	 ok, will reimage 2012 once you're done then!
[14:47:54] <bblack>	 [done]
[14:48:17] <bblack>	 if the autoinstall fails, it will likely be because I did something stupid in late_command, we'll see! :)
[14:48:45] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ```  Of which those **FAILED**: ``` ['cp2012.codfw.wmnet'] ```
[14:48:58] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014...
[14:49:11] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ```  Of which those **FAILED**: ``` ['cp2012.codfw.wmnet'] ```
[14:49:45] <ema>	 mmh, starting a new reimage failed with "Signed cert on Puppet not found for hosts ['cp2012.codfw.wmnet'] and no_raise=False"
[14:51:19] <volans>	 same host?
[14:51:27] <volans>	 ofc, it was removed by the previous one
[14:51:38] <volans>	 add --no-verify --no-downtime
[14:51:44] <ema>	 volans++
[14:51:59] * volans guess a first reimage failed
[14:52:11] <volans>	 but if it failed in the middle we can restart from where it broke more or less
[14:52:14] <volans>	 depending on the failure
[14:52:17] * volans need more context
[14:52:18] <ema>	 volans: I've stopped it halfway through yeah
[14:52:39] <volans>	 and you want to redo all the d-i and pxe?
[14:52:48] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014...
[14:53:27] <ema>	 volans: yes, because (1) due to PEBKAC I've started the reimage with jessie instead of stretch and (2) we're trying different mkfs options
[14:53:34] <volans>	 akc
[14:53:35] <volans>	 ack
[14:56:18] <ema>	 d-i started
[14:59:35] <ema>	 bblack: mkfs phase went smoothly, installing base system now
[15:01:06] <bblack>	 the mkfs in late_command is after the base system
[15:01:13] <bblack>	 (not part of partman stufF)
[15:01:19] <ema>	 ah right!
[15:01:51] <bblack>	 the command worked fine on cp1075, the concern is my refactoring to use a shell variable to reduce duplication could've cause some stupid sh error
[15:04:51] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) @Imarlier That's good info, thanks!
[15:09:42] <ema>	 root@cp2012:~# dumpe2fs /dev/sdb3 | grep features
[15:09:42] <ema>	 dumpe2fs 1.43.4 (31-Jan-2017)
[15:09:42] <ema>	 Filesystem features:      resize_inode sparse_super2 filetype extent 64bit flex_bg sparse_super large_file huge_file bigalloc metadata_csum
[15:09:49] <ema>	 looks good ^
[15:10:06] <bblack>	 nice
[15:10:26] <XioNoX>	 fyi, ulsfo is still depooled for last week's links failures, I'll re-pool after the meeting
[15:10:42] <bblack>	 ok
[15:11:23] <bblack>	 ema: so the removal of data=writeback from fstab is already puppetized to all stretch caches.  it won't fail on the old ones, they'll just not have the mkfs improve
[15:12:00] <bblack>	 I think we already had reasons to need a post-puppetization reboot on new installs, but in any case the scsi_mod param definitely needs one
[15:12:54] <bblack>	 so cp2006, should just need to depool, stop varnish.service, unmount /srv/sd[ab]3, run the mkfs commands on sd[ab]3, remount, restart varnish
[15:13:28] <bblack>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/449201/1/modules/install_server/files/autoinstall/scripts/late_command.sh
[15:14:05] <ema>	 and reboot for use_blk_mq I guess
[15:14:21] <bblack>	 2006 already picked it up, it was merged earlier than the FS stuff
[15:14:26] <ema>	 ah great
[15:19:08] <bblack>	 so at this point cp1075 looks pretty good.  it's depooled from cache_text but otherwise live-ish.
[15:19:13] <ema>	 bblack: fs size issues on 2012 it seems
[15:19:18] <ema>	 /dev/sdb3      ext4      364G   48M  356G   1% /srv/sdb3
[15:19:30] <ema>	 -s main2=file,/srv/sdb3/varnish.main2,360G
[15:19:41] <bblack>	 ema: did it fail, or just fails to use all the space?
[15:19:55] <ema>	 bblack: varnish-be fails to start 
[15:20:08] <ema>	 Jul 30 15:18:32 cp2012 varnish[7199]: Error: (-sfile) allocation error: No space left on device
[15:20:13] <bblack>	 ok taking a look
[15:21:31] <bblack>	 yeah the size came up slightly-smaller than I expected in the other case too, becaues of the 16M chunks allocation is done in
[15:23:55] <bblack>	 yeah, so 355G works, previous was 360G
[15:24:06] <bblack>	 but it seems odd the 16M allocation thing would cause that
[15:24:17] <bblack>	 (5G difference for 16M alignment?)
[15:26:33] <bblack>	 eh it's not worth messing with perhaps (or can be some later experiment)
[15:26:47] <bblack>	 I'll patch to not vary that stuff on stretch, just on new-nvme hosts vs rest
[15:28:57] <ema>	 ok
[15:31:03] <bblack>	 I'll clean up 2012 to match, 2006 should be fine as it is
[15:36:17] <ema>	 yeah
[15:38:08] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ```  and were **ALL** successful.
[15:38:11] <ema>	 bblack: cool, reimage concluded properly now (it was waiting for puppet)
[15:41:07] <ema>	 vgutierrez: FYI on cp2012 librdkafka1's pinning worked but the stretch version wasn't automatically installed, probably some other package depending on librdkafka1 installed it before the pinning took effect 
[15:41:32] <ema>	 vgutierrez: I'm upgrading it by hand now and restarting varnishkafka but we should look into this later (or tomorrow!)
[15:42:12] <bblack>	 all cleaned up
[15:42:41] <bblack>	 so, cp1075 is basically in a seemingly-good state, just depooled from text@eqiad as its 9th member
[15:42:48] <ema>	 nice
[15:43:35] <bblack>	 I figure next step on that front is pool it up as #9 and see how it flies, then we can plot out the rest of that hardware set
[15:43:48] <bblack>	 but maybe we get past doing some cache_text stretch upgrades on existing nodes first
[15:44:04] <ema>	 yeah, +1
[15:44:31] <bblack>	 for now it can sit and burn in, maybe we find a kernel crash while we're waiting or whatever :)
[15:44:52] <ema>	 :)
[15:47:19] <bblack>	 I'll install the others in the meantime, at least the base OSes
[15:47:44] <bblack>	 I don't remember if adding a bunch of depooled nodes at the pybal or varnish levels perturbs (c)hashing pointlessly at one of the layers
[15:48:42] <ema>	 if chashing does its job, it shouldn't
[15:50:09] <ema>	 !log reboot cp2012 for kernel update
[15:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:48] <wikibugs>	 10Traffic, 10DNS, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson)
[16:56:32] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Krenair) Some of them are so similar to each other I believe we can do a huge merge: https://gerrit.wikimedia.org/r/#/c...
[17:59:46] <XioNoX>	 bblack: so far, so good: https://puppetboard.wikimedia.org/fact/default_routes Will wait ~2h then look at pushing the mtu change?
[18:02:57] <bblack>	 XioNoX: yeah, the ulsfo one
[18:03:38] <XioNoX>	 yep
[23:09:03] <legoktm>	 do we have stats on how many requests we get on HTTP? I'm curious about how many people are running either an outdated HSTS preload list, or don't have one at all