[00:12:01] 10Traffic, 10Operations, 10Patch-For-Review: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10Varnent) 05Open>03Resolved a:03Varnent Thank you @BBlack for all your help today! [00:41:11] bblack: do you know about ^ (stats on HTTP traffic) [02:00:31] 10Traffic, 10Operations, 10Patch-For-Review: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10Varnent) Here is a ticket for the redirects in general on the new site: [T200754] [02:09:29] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) Task with info on redirects: [T200754] [05:05:12] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 5 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) [06:49:20] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2018.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073106... [07:15:15] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2018.codfw.wmnet'] ``` and were **ALL** successful. [07:17:46] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2025.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073107... [07:18:51] ema: if the reimages are smooth enough and you need to do a lot of them, you could use the wrapper script to do more than one at a time (you can put sequential and also a sleep in between as needed) [07:33:12] volans: nice, yeah, so far I'm just finishing up misc-codfw, then I'll use the wrapper! [07:43:43] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2025.codfw.wmnet'] ``` and were **ALL** successful. [07:50:50] !log reboot cp2025 for kernel updates [07:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:55:21] !log reboot cp2018 for kernel updates [07:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [09:40:26] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp3008.esams.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073109... [09:46:20] !log reboot cp1045 (jessie) for kernel updates [09:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [10:12:34] elukey, vgutierrez: cp3008, just upgraded, did pick up the right librdkafka1 version [10:13:02] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3008.esams.wmnet'] ``` and were **ALL** successful. [10:13:30] so stretch reimages do not require any manual step anymore, except for double-checking that things are fine and repooling [10:15:15] nice! [10:19:54] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp3007.esams.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073110... [10:25:45] 10Wikimedia-Apache-configuration, 10Operations, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Joe) @Krenair I think I will just reproduce the patches I did to the mediawiki_test environment in the main one, that l... [10:51:46] ema: awesome [10:52:53] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3007.esams.wmnet'] ``` and were **ALL** successful. [11:00:25] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp3010.esams.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073110... [11:31:58] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3010.esams.wmnet'] ``` and were **ALL** successful. [13:04:07] ok I'm gonna try to upgrade two eqiad-misc hosts in parallel [13:04:21] * ema feels adventurous [13:06:05] go for it :D [13:06:34] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1045.eqiad.wmnet', 'cp1051.eqiad.wmnet'] ``` The log can be found in `/var/l... [13:33:40] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1045.eqiad.wmnet', 'cp1051.eqiad.wmnet'] ``` and were **ALL** successful. [13:33:44] \o/ [13:41:23] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1058.eqiad.wmnet', 'cp1061.eqiad.wmnet'] ``` The log can be found in `/var/l... [13:44:06] aes128-sha usage around 0,046% today <3 [13:45:51] hm [13:46:09] should I make stretch cache hosts in deployment-prep? [13:46:44] hi Krenair! [13:46:58] Krenair: btw, could you take a look into https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/446618/ ? [13:47:24] Krenair: is not finished yet (some tests missing), but the main code should be, so I'd love to see your input there [13:47:26] Krenair: yes, reimaging as stretch should Just Work [13:47:35] vgutierrez, alright, I'll take a look tomorrow [13:47:38] thx :D [13:47:49] ema, uh well we can't reimage labs hosts [13:48:02] we just replace them with new ones [13:48:11] Krenair: yeah I meant destroying and recreating the VM with a stretch image [13:50:05] Krenair: we do have traffic-text-stretch.traffic.eqiad.wmflabs in labs which works [14:05:57] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1058.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['cp1058.eqiad.wmnet'] ``` [14:13:58] meh [14:14:47] what failed? [14:16:42] volans: cp1058's remove_from_puppet [14:17:02] wasn't in it? let me check [14:17:18] volans: https://phabricator.wikimedia.org/P7404 [14:19:43] wut [14:19:44] Error: header too long [14:19:52] Error: Try 'puppet help node clean' for usage [14:19:55] uh? [14:20:01] less -R 201807311341_ema_22491_cp1058_eqiad_wmnet_cumin.out [14:20:03] at the end [14:21:34] volans: do you think we could disable all fancy colors if redirecting to a file? :) [14:22:22] that's partially a bug with tqdm and a little bit how cumin uses it unfortunately, the progress bar should not go to the file at all [14:22:34] but thanks to ge.hel patch to decouple the progress bar [14:22:41] we're getting close [14:22:51] nice [14:22:55] and yes the plan is to be able to run cumin as a lib without that [14:24:32] ema are you re-trying the reimage or you want me to debug the deactivate by itself? [14:25:04] volans: I'll try once more [14:26:31] lol, when you look for puppet errors you always find amusing things [14:26:32] https://serverfault.com/questions/627361/puppet-cert-list-all-error-header-too-long [14:26:33] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1058.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20... [14:32:26] volans: 14:26:17 | cp1058.eqiad.wmnet | Removed from Puppet [14:32:45] so all good this time? [14:32:53] so it seems [14:33:03] * volans wonders if can be some races with puppet merge or anything else [14:33:13] you did 2 hosts at the same time right? [14:33:17] correct [14:33:34] so I guess there was another request for the other host too [14:33:48] and all of a sudden that stackoverflow link start making some sense [14:34:42] although this was removing from puppet and not signing a new request [14:34:51] I bet there are many races in the puppet CA managing of certs [14:35:10] yeah but that error message seems to affect lots of operations, not only signing requests [14:36:14] very informative message btw, "header too long" [14:37:26] yeah [14:39:59] I like the "captain speaking" messages from -auto-reimage very much [14:40:08] Started first puppet run (sit back, relax, and enjoy the wait) [14:45:30] :-) [14:50:57] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1058.eqiad.wmnet'] ``` and were **ALL** successful. [14:53:33] ok then, 12 cache hosts running stretch (cache_misc), only 81 to go! [14:55:27] \o/ [15:01:12] let's upgrade a text and an upload node and see what breaks [15:10:18] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3031.esams.wmnet', 'cp3044.esams.wmnet'] ``` The log can be found in `/var/l... [15:15:48] no race this time [15:15:48] 15:10:15 | cp3044.esams.wmnet | Removed from Puppet [15:15:48] 15:10:15 | cp3031.esams.wmnet | Removed from Puppet [15:17:09] was the same second the previous time too? [15:22:20] ema: are you going to upgrade them all today? [15:22:46] XioNoX: nope, beer o'clock is approaching [15:23:18] nice! [15:23:29] ema: what's the timeline for that? [15:23:39] (upgrade, not beer) :) [15:24:00] volans: according to the logs it wasn't, cp1061 was removed successfully at 13:41:20, while the removal of cp1058 failed at 13:41:16 [15:24:19] ema. ack so still misterious [15:25:46] XioNoX: no clear timeline, but assuming no issues a couple of weeks [15:27:26] ema: ok! asking as I started to push the changes for T195365 in ulsfo yesterday and will probably do codfw next, I don't expect any compatibility issues, but not doing the two at once we can ensure one change fixed the issue and not the other [15:28:38] XioNoX: ok I will refrain from upgrading other nodes in ulsfo/codfw till you give me the green light [15:29:27] XioNoX: or if you think it's better to pause the stretch upgrades altogether till T195365 is fully deployed let me know [15:30:27] ema: nah, don't worry, will probably do codfw today or tomorrow, and 24h will be enough to know if things are better/worse/same [15:30:40] cool [15:35:15] 15:32:00 | cp3031.esams.wmnet | Still waiting for reboot after 15.0 minutes [15:35:23] no bueno, power cycling [15:36:07] !log power-cycle cp3031, stuck rebooting [15:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:17] UEFI0060: Power required by the system exceeds the power supplied by the Power [15:40:20] Supply Units (PSUs). [15:40:22] Check the PSU and system configuration, and then upgrade the PSU, if necessary. [15:40:40] way to go, kick the rebellious nodes while nobody's there to keep them in line :) [15:45:26] 10Traffic, 10Operations: cp3033: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10ema) [15:45:30] 10Traffic, 10Operations: cp3033: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10ema) p:05Triage>03Normal [15:46:07] 10Traffic, 10Operations: cp3031: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10ema) [15:58:54] volans: 15:58:42 | cp3031.esams.wmnet | Still waiting for reboot after 40.0 minutes [15:59:00] will it ever give up? :) [15:59:15] yes, eventually [15:59:23] I kept the timeout pretty large [15:59:30] to allow to fix manually the issue in the meanwhile [15:59:35] and let the script continue [15:59:57] is this before d-i, the first reboot into PXE? [16:00:54] ema: ^^^ [16:01:06] volans: nope, apparently it did reach the d-i phase [16:01:13] 15:10:17 | cp3031.esams.wmnet | Set Boot Device to pxe [16:01:17] [...] [16:01:21] 15:16:39 | cp3031.esams.wmnet | Host up (Debian installer) [16:01:30] ok, so this is waiting the d-i [16:01:47] it might take long time, mw hosts take 1h10m [16:02:00] volans: nope, T200806 [16:02:01] T200806: cp3031: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 [16:02:26] ah [16:02:32] so it's broken somehow? [16:02:47] did you try ssh into mgmt and check the console? [16:02:48] it is dead, Jim [16:02:51] :) [16:03:17] volans: that sad message about PSUs is what the console said [16:03:31] and I guess you did try to reboot it [16:03:59] yes even trying to switch it off and on again didn't work [16:04:16] but it did work the first time into d-i... nice [16:05:27] btw the timeout is 1h but you can ctrl+c anytime if you want to stop the script [16:07:10] nah, I like to see -auto-reimage trying its best [16:08:05] ahahah but it's not doing anything right now, just polling [16:09:24] there's something romantic in this desperate attempt [16:09:49] ema: if you're still around https://gerrit.wikimedia.org/r/#/c/449484/ [16:10:12] (yes, I shamelesly copied your commit message) [16:11:03] XioNoX: +1 for the commit message [16:11:20] aahahah [16:11:51] ok so cp3044 (upload) is up and running with stretch [16:17:09] not pooling it for now [16:17:50] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3031.esams.wmnet'] ``` Of which those **FAILED**: ``` ['cp3031.esams.wmnet'] ``` [16:18:22] ah there you go, it gave up [16:50:17] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Remove referrer check from varnish for maps cluster - https://phabricator.wikimedia.org/T137848 (10Mholloway) [17:18:00] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 5 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) @BBlack and @Reedy - One of the places that did not seem to respect the temporary nature of that U... [17:27:19] 10Traffic, 10DNS, 10Operations, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10greg) [17:43:11] 10Traffic, 10DNS, 10Operations, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10BBlack) They may have cached it during the brief time it was a 301 rather than a 302 in the changes above, rather... [17:50:46] 10Traffic, 10DNS, 10Operations, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) Thank you for the quick response! I am open to ideas. In IRC it was suggested that we make some updates... [18:21:41] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10I18n: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there - https://phabricator.wikimedia.org/T166782 (10Varnent) 05Open>03declined No longer applies to new site. [19:22:28] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [19:22:45] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [19:42:12] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 (10ayounsi) [20:16:20] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) I've opened case 977580870 to coordinate getting a Dell Tech dispatched to eqsin with a replacement part. [22:54:16] 10netops, 10Operations, 10ops-esams: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) [22:54:19] 10Traffic, 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [22:54:22] 10Traffic, 10netops, 10Operations: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi)