[02:49:17] hmm it looks like cp5037 went AWOL during the APAC night [02:49:21] *cp3057 [07:33:08] hi! [07:33:28] I am seeing cp5012 down, under maintenance? [07:34:08] ah it seems moving to ATS [07:34:41] that failed, will ack with T227432 (cc: ema) [07:34:42] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [07:46:06] elukey: yeah... we're experiencing some issues reimaging cp5012 [08:03:32] vgutierrez: buenos dias [08:03:39] morning :) [08:04:14] vgutierrez: I was expecting to wake up to fifo-log-demux 0.6 in the archive :) [08:04:20] ouch [08:04:24] yeah, let me release that [08:04:28] thanks [08:05:11] I've uploaded trafficserver wm10 so we should test both on one host and then upgrade everything I guess [08:06:23] vgutierrez: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/548581/ LGTM but maybe let's wait for godog before merging [08:06:39] sure [08:07:08] hmm it looks like I've never released fifo-log-demux [08:09:06] (building it right now) [08:14:36] ema: fifo-log-demux 0.6 uploaded [08:18:31] nice! [08:24:56] elukey: https://phabricator.wikimedia.org/T237360 [08:25:10] here's the mystery of the day [08:26:17] good luck :) [08:27:57] I don't need luck, I need a rescue shell [08:59:28] greetings [08:59:35] ema: will take a look shortly [09:00:04] in the meantime I'll nag again here re: expiring certificates warnings on icinga, what can we do about those ? [09:09:59] godog: we are aware of the noise, but we're still waiting for the new globalsign certificate AFAIK [09:12:10] I thought we ack'ed them all though [09:12:14] why are they back? [09:12:36] the downtime already expired I think [09:12:49] ah [09:14:33] yeah that's what happened, ok to just ack them all? I'm assuming the issue is on your radar regardless of alerts [09:15:03] godog: yeah, let me do that [09:15:21] ema: thanks! [09:15:38] meanwhile, I think the phab bot stopped working as usual, and I was trying (and failing) to find the wikitech article for how to restart it [09:15:55] let me know if you find it while a click around :) [09:16:09] s/a click/I click/ [09:16:22] looks like to me stashbot is working, e.g. [09:16:26] https://phabricator.wikimedia.org/T236924#5634683 [09:16:48] yup.. but I've created some tasks and we didn't get the usual message here [09:17:02] like T237348 [09:17:03] T237348: cp3057 is unreachable - https://phabricator.wikimedia.org/T237348 [09:17:44] so wikibugs is the misbehaving bot [09:17:56] ah! my bad [09:18:20] it's still online but I guess it has some existential issues [09:18:39] wikibugs: halp [09:19:14] and yet people are afraid that AI will steal our jobs [09:19:34] so far it looks like it creates plenty of work for us! [09:24:36] 10Traffic, 10Operations: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) p:05Triage→03Normal [09:25:58] <3 [09:26:05] it's back! [09:26:36] ema: BTW... regarding cp3057, anything else to check besides system logs and the idrac logs? [09:27:08] apparently the poor bastard died without logging anything anywhere [09:30:56] vgutierrez: we did have some similarly suspicious issues on different cache nodes that happened only one single time [09:31:13] so I'd apply the usual approach: ignore if it happens once, debug the second time? [09:31:54] hard to debug when there is no evidence, but yeah [09:31:59] right now is behaving as expected [09:34:03] there are a few cases where microcode isn't correctly applied despite intel-microcode installed, the kernel recent enough and the CPU supported by Intel [09:34:51] confusingly there's also a genuine hardware but which prevents microcode loading sometimes, but when adding an Icinga check to catch such regressions, these were noticed [09:35:13] maybe these are also caused by initrd issues of some sort [09:43:58] 10Traffic, 10Operations: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) As an update, cp5012 is currently reimaging (`Started first puppet run` phase). The initramfs looks like this right now: ` root@cp5012:~# ls -l /boot/... [09:45:29] ema: so... it looks like it was one time glitch during d-i? [09:45:50] two times (that's why I'm debugging it now) :) [09:46:10] it failed once, I've reimaged it again, it failed twice [09:46:20] this is the third reimage [09:46:40] /o\ [09:49:54] and this time it worked [09:52:07] what failure ? [09:52:44] volans: T237360 [09:52:44] T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 [09:54:53] 10Traffic, 10Operations: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5012.eqsin.wmnet'] ` and were **ALL** successful. [09:56:10] did you saw any packet loss by any chance between the cp host and the install server? I'm wondering if the transfer of the image got corrupted [09:56:11] godog: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Prometheus::Class_config[trafficserver_backend_text_esams]: has no parameter named 'cluster' (file: /etc/puppet/modules/profile/manifests/prometheus/ops.pp, line: 292) on node bast3004.wikimedia.org [09:56:15] just a wild guess [09:56:22] godog: it looks like I broke puppet [09:56:24] 10Traffic, 10Operations: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) This time, after reimaging the host it did boot properly. Also, initramfs size is now in line with that of other cp5 systems: ` cp5010.eqsin.wmnet: -r... [09:57:17] godog: I guess I need to change to cluster_config after all :) [09:57:37] volans: the initrd is generated by the system, not copied over the network [09:58:01] by /usr/sbin/update-initramfs --something [09:58:30] this is after d-i? [09:59:04] yes, the first reboot after d-i [09:59:08] sorry I thought it was d-i failing to boot, my bad, read to quickly :) [09:59:28] np, thanks for checking! [10:09:35] vgutierrez: hah! indeed, I missed the fact that there was no PCC in the review heh [10:09:42] my bad [10:09:57] godog: fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/548702 [10:10:17] pcc shows a NOOP cause production is currently broken [10:10:24] so it's unable to provide a DIFF [10:10:29] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/548702 [10:11:16] vgutierrez: ack, IIRC 'cluster' is added automatically to labels, you can skip it [10:12:09] hmm other definitions using cluster_config add it verbosely as well [10:12:20] but let me check [10:12:46] also I think it's useful to be explicit about it, for future refactors and so on [10:13:44] fair enough! +1'd [10:22:48] puppet is now happy on bast3004 [10:23:24] godog: one question, how we get rid of the old definitions? I've seen the creation of the new 4 File resources but it doesn't look like the old ones have been purged [10:25:11] and the yaml files are still there of course [10:25:54] vgutierrez: the easy solution is cumin + rm, puppet doesn't recursively manage the directory [10:26:08] ack, will do [10:30:50] godog: do I need to ping prometheus after cleaning the old files? [10:31:20] vgutierrez: no it'll pick up changes by itself [10:31:25] lovely [10:31:59] hmmm interesting, it looks like the old-old /srv/prometheus/ops/targets/trafficserver_eqsin.yaml is still there [10:32:30] at least on eqsin where we ran the first TS instances [10:44:38] ema: Oct 25 08:21:59 cp5006 atsmtail-backend[13805]: E1025 08:21:59.488841 13827 log_watcher.go:190] fsnotify error: fsnotify queue overflow [10:44:43] ema: this one is new, at least for me [10:45:02] !log restarting atsmtail@backend on cp5006 [10:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:16] vgutierrez: yup, never seen that either [10:57:44] BTW, I've detected it by tailing the prometheus server.log on bast5001 [10:57:56] godog: do we have any kind of check looking for errors there? [10:59:23] vgutierrez: not ATM, what's prometheus log line ? [10:59:38] Nov 5 10:44:24 bast5001 prometheus@ops[4732]: level=warn ts=2019-11-05T10:44:24.525297041Z caller=scrape.go:1094 component="scrape manager" scrape_pool=trafficserver-upload target=http://cp5006:3904/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=74 [11:00:04] maybe getting a metric of dropped metrics would be enough [11:00:46] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [11:02:04] alright! ulsfo and eqsin are now fully converted to ats-be, both for upload and text [11:02:11] ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/548712 so this are the basic metrics needed to provide ats-tls visibility in https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1 [11:02:26] and of course we need the global ones as well [11:02:46] but as soon as we get ats-tls to perform varnish-fe duties... those metrics are going to be polluted by PURGE requests :/ [11:04:08] vgutierrez: looks good! [11:04:21] yup, but I'm worried about the PURGE thingie [11:04:44] yeah, we might want to patch ATS for that [11:04:49] I know.. that's a future vgutierrez issue.. [11:06:00] yup... it looks like we should get new metrics on ATS itself to fix it [11:24:02] vgutierrez: aye there should be a metric already I think on dropped data points, maybe no alert though [11:24:18] I seem to remember a similar mtail error in the past, checking [11:30:21] looks like we can disable fsnotify entirely, this is stdin anyways, https://github.com/google/mtail/issues/195 [11:30:49] vgutierrez: mind opening a task for this issue? and cc observability ? [11:32:47] no problem. tomorrow APAC morning though [11:32:55] I'm on my way to get dinner [11:34:03] vgutierrez: enjoy! [12:39:49] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) >>! In T233661#5634559, @elukey wrote: >>>! In T233661#5632172, @BBlack wrote: >> Agreed, let's not go down that road right here... [12:41:09] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) >>! In T233661#5633768, @Nuria wrote: > @BBlack: once we deploy the VCL/varnish-kafka chnages we need to change our refine pipel... [15:19:48] ema: any pending reimage left? [15:20:16] volans: not today, many to come [15:20:26] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10elukey) >>! In T233661#5635537, @BBlack wrote: >>>! In T233661#5634559, @elukey wrote: >>>>! In T233661#5632172, @BBlack wrote: >>> Agree... [15:20:42] if you have a "safe" one, could I use it for the provisioning onboarding chat? [15:20:51] of course [15:20:52] I would run the reimage during the chat [15:21:04] we're gonna go for a cache text host in esams :) [15:21:14] wfm :) [15:35:37] 10Traffic, 10Operations: ats-tls-restart failed on cp4027 - https://phabricator.wikimedia.org/T237425 (10ema) [15:37:06] volans: when is that? [15:37:56] trying to set it up, at this point I think most likely Fri., I can't Thu. and tomorrow there are a bunch of meetings for the people involved [15:38:28] ok keep me updated! [15:41:54] Nov 05 15:22:50 cp4027 update-ocsp-all[24836]: touch: cannot touch '/srv/trafficserver/tls/etc/ssl_multicert.config': Read-only file system [15:42:00] ^ that's a rather odd message... [15:44:25] something to do with systemd namespacing/security/binding ? [15:44:31] the real rootfs was never ro [15:44:42] volans: you could also reimage one of cp1071-1074, those are spares waiting for decom, but they have the new hiera flag enabled which manages adduser.conf with some revised settings and seeing the affect of that change to a freshly installed system would also be interesting [15:45:38] yeah we have a number of ex-traffic nodes that pre-date the poweroff/boot-wipe thing, sitting in spares somewhere [15:45:41] moritzm: any host would do, happy to pick one of those [15:45:48] (and also a few intentional spares that we do want to keep warm, need to sort those out) [15:46:41] bblack: right, we have ProtectSystem=strict in the unit [15:46:56] so everything is read-only except for the directories we whitelist with ReadWritePaths [15:47:52] volans: actually, Alex is testing stuff on cp1071, so only 1072-1074 :-) [15:48:42] ack :) [16:00:08] just so long as the plan isn't to keep the same private key forever because there's no sane/easy way to ever change private keys :) [16:00:40] oh don't worry, that's never _the plan_ ;) [16:00:41] (there should be. a PKI system really isn't much of a PKI system if there's no way to handle a smooth overlapped transition of authority) [16:20:13] bblack: so if you have a host that I can reimage and then immediately decommission that would be perfect for the onboarding chat :) [16:21:46] you can do that to any of the unused cp1071-4 [16:27:22] akosiaris runs some tests on 1071 currently [16:28:11] please don't kill my etcd vagrant VMs on cp1071 [17:43:19] 10Traffic, 10Operations, 10codfw-rollout: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482 (10BBlack) 05Open→03Declined We're not going down this road at all. `cache::route_table` will just go away when all cache backends have converted to ATS in T227432, which doesn'... [17:49:12] 10Traffic, 10Operations: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181 (10BBlack) 05Open→03Resolved a:03BBlack Yes, this task was long-ago completed. See also https://phabricator.wikimedia.org/phame/post/view/111/wikipedia_goes_100_forward_secret/ [17:52:35] 10HTTPS, 10Traffic, 10Operations: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10BBlack) @MBeat33 / @Jseddon - Any update yet? [17:56:07] 10Traffic, 10Operations: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 (10BBlack) 05Open→03Resolved a:03ayounsi [17:56:11] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10BBlack) [17:58:57] 10Traffic, 10Operations: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525 (10BBlack) 05Open→03Declined I don't think we'll go the LVS route here. [17:59:00] 10Traffic, 10Operations: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10BBlack) [17:59:04] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [18:06:33] 10Traffic, 10Operations, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10BBlack) [18:06:36] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10BBlack) [18:06:39] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [18:07:59] 10Traffic, 10Operations, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10BBlack) 05Open→03Resolved a:03BBlack With anycast recdns deployed at all sites with fallback routing towards the cores (or to the opposite core, as the case may be),... [18:08:07] 10Traffic, 10Operations: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498 (10BBlack) I think this is actually fairly orthogonal to some of the other improvements. Not sure what current/modern thinking is on this either, probably needs re-evaluation. My gut feeling it... [18:08:49] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [18:08:56] 10Traffic, 10Operations, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10BBlack) [18:08:59] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [18:09:18] sorry for the spam, doing some low-hanging ticket cleanup / triage stuff [19:20:29] 10Wikimedia-Apache-configuration, 10User-Dereckson, 10patch-welcome: svn.wikimedia.org/doc/ should redirect to doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Krinkle) [19:20:37] 10Wikimedia-Apache-configuration, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: svn.wikimedia.org/doc/ should redirect to doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Krinkle) [19:22:06] 10Wikimedia-Apache-configuration, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: svn.wikimedia.org/doc/ should redirect to doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Krinkle) Just fixed another one of these ([edit](https://www.mediawiki.org/w/index.php?title=R... [19:43:14] 10Traffic, 10Operations, 10SRE-tools, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) @BBlack the current proposal is: - On the netbox host(s) there will be a script to generate the snippet files that will perform some... [19:43:23] 10Traffic, 10Operations, 10SRE-tools, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) p:05Triage→03Normal [21:21:18] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-4/0/25 { - description radon; - } [21:21:58] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) [21:27:12] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) [21:27:49] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) 05Open→03Resolved Complete [23:55:28] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Switch on http/2 in apache for gerrit - https://phabricator.wikimedia.org/T180978 (10Dzahn)