[05:58:11] 10netops, 10Operations: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) [05:58:49] 10netops, 10Operations: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) [06:44:51] 10netops, 10Operations: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) p:05Triageβ†’03High [07:02:13] 10netops, 10Operations: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) Opened JTAC case 2020-0601-0882, at this point it's too much of a coincidence to not think of a backplane issue. [08:13:12] 10Traffic, 10Operations: atskafka: expose rdkafka metrics to prometheus - https://phabricator.wikimedia.org/T253551 (10ema) 05Openβ†’03Resolved a:03ema This is now done: atskafka uses [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/prometheus-rdkafka-exporter | prometheus-rdkafka-expo... [12:04:53] XioNoX: if cr1-codfw failed completely, would we leave codfw pooled with just one cr? [12:16:38] cdanis: I hesitated but cr2 is not showing any issues, the new fpc0 should arrive this morning codfw time, and hopefully JTAC should have got back to me about fpc5 by then [12:17:01] πŸ‘ [12:17:15] yeah, I can imagine arguments either way, but that's all reassuring [12:17:44] if they say it's a backplane issue though we will depool it for the replacement [12:17:55] and maybe use that as an opportunity to upgrade the routers [12:18:29] it's all designed to be able to run on a single router, right [12:18:58] that should be possible, so if/when it's not, that would be good to know [12:19:18] sure, I wasn't worried from a capacity perspective really -- just that there's a difference between "N+0 for a few hours" vs "N+0 indefinitely" [12:19:44] fair [12:24:20] re-read JTAC's email "You may expect the next update in 12 Hrs." that was 2h ago. And "My working hours are from Monday-Friday, 06:00 - 14:00 GMT" The maths don't line up [12:25:34] waiting 1 more hour before asking to re-assign the task [12:27:33] that means they reassign to someone else after end-of-workday right [12:49:22] 10Acme-chief, 10Traffic, 10Operations: Provide ensure => absent support for acme_chief::cert define - https://phabricator.wikimedia.org/T229097 (10Vgutierrez) 05Openβ†’03Resolved a:03Vgutierrez This has been automagically solved with 725e7f4eeb37a3742591a3f7357b6862e3b4c361, moving OCSP stapling to the a... [12:49:24] 10Traffic, 10Operations: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) [12:53:33] 10Traffic, 10Operations: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10Vgutierrez) [12:53:58] 10Traffic, 10Operations: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10Vgutierrez) p:05Triageβ†’03Medium [12:59:09] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) can we close this task or at least change the task title to lfocus on the icinga alerts? there is no issue with cert renewal itself :) ` w... [13:08:40] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) Yes, it should be renamed. But i think it is traffic team's decision what to do about the monitoring per this being the " primary automated mon... [13:09:18] 10netops, 10Operations: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) > Please find the below KB for TOE chip memory errors reported on routers with MPC-3D-16XGE-SFPP FPCs. > https://kb.juniper.net/InfoCenter/index?page=content&id=KB31235 > These messages could indicate... [13:10:40] cdanis: they replied, we need to reboot the fpc and for that I'd prefer to depool codfw. I have a vet apt in 20min so I'll do it when I'm back [13:12:02] ack [14:28:57] cdanis or traffic: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/601741 [14:29:18] +1 [14:29:31] cdanis should just join traffic really [14:29:48] indeed [14:29:50] I'd have to learn about LVS if I did that [14:29:55] seems scary [14:30:19] the famous traffic<->Infra Foundations revolving doors [14:30:45] cdanis: 2 votes against 1, basic democratic principles say that you now work for traffic \o/ [14:31:29] congrats cdanis, welcome to the team [14:31:32] πŸ€” [14:34:04] :) [14:38:04] hey, it's not that simple, you have to give one back if you kidnap cdanis! [14:39:00] or I have to call some special friends from some special region of my country? :-P [15:21:38] 10netops, 10Operations: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) 05Openβ†’03Resolved FPC reboot solved the issue. Will re-open if it re-appears. [16:45:25] 10Acme-chief, 10IPv6, 10Patch-For-Review, 10cloud-services-team (Kanban): tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses - https://phabricator.wikimedia.org/T245937 (10Andrew) @krenair, can you summarize the results here? It looks resolved but it's not cle... [17:06:29] hi! I have a question about Icinga. I had a service failing, like: "WARNING: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[dnsdist]" (it's now fixed) [17:06:43] is there any way I can raise the status to "critical" for such things? [17:07:04] the reason being that I do feel this is critical and that I should be notified on IRC (which it didn't) [17:11:53] sukhe: check_puppetrun -w 10800 -c 21600 [17:12:05] so it should raise to critical later on [17:12:12] the number is in seconds [17:15:51] The email to noc@ "Retirement of GeoIP Legacy Downloadable Databases May 2022" might be of interest [17:16:01] "We’re reaching out because you downloaded at least one GeoIP legacy database last month." [17:16:25] XioNoX: ah oh, warning level and interval, thanks [17:17:52] * sukhe tries to figure out how to set it for the host [17:21:10] sukhe: it's the default value fleet wide, I don't think you can override it per host [17:22:57] ah ok. I don't know why I assumed that it would notify immediately in case puppet failed to restart the service [17:23:58] I mean, I got the error message during the merge itself on the host, which I guess does count that things went wrong [18:38:56] 10Acme-chief, 10IPv6, 10Patch-For-Review, 10cloud-services-team (Kanban): tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses - https://phabricator.wikimedia.org/T245937 (10Krenair) >>! In T245937#6185676, @Andrew wrote: > @krenair, can you summarize the results... [20:29:38] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle)