[07:26:23] 10Traffic, 10Operations: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3603561 (10elukey) Same thing happened this morning for cp1068 from 6:45 to 6:48 UTC: {F9522769} Self recovered, caused 503s and alerts for various text domains. [07:39:48] 10Traffic, 10Continuous-Integration-Infrastructure, 10DNS, 10Operations: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3606984 (10hashar) That is related. As I migrated some jobs from Trusty to Jessie, I have added a couple Jessie instances. Tha... [07:41:04] 10Traffic, 10Operations: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3606987 (10elukey) p:05Triage>03High [08:51:02] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10Operations, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3607072 (10Krinkle) [09:40:48] 10Traffic, 10Continuous-Integration-Infrastructure, 10DNS, 10Operations: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3607247 (10hashar) I am trying to add the GeoIP files on the CI puppet master. Gotta fix some puppet madness with an undefined... [09:45:21] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, 10Operations: Make maps active / active - https://phabricator.wikimedia.org/T162362#3607256 (10Gehel) @Pnorman could you have a look at the codfw servers and see if we are ready to move on this? For reference, the puppet change to do: https://gerrit.wikimed... [10:44:59] 10Traffic, 10Continuous-Integration-Infrastructure, 10DNS, 10Operations, and 2 others: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3607455 (10hashar) a:03hashar I have rebuild the jenkins build and it passed on the slave 1003 ( https://integra... [10:45:43] 10Traffic, 10Continuous-Integration-Infrastructure, 10DNS, 10Operations, and 2 others: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3607458 (10hashar) p:05Triage>03Normal [15:35:53] 10Traffic, 10Operations: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202#3608227 (10BBlack) Another note-to-self for the future: https://gerrit.wikimedia.org/r/#/c/301817/ is where we removed the fairly-similar `AES128-SHA256` and `AES128-GCM-SHA256`, throwing all such cl... [17:15:37] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: two switches have same serial in racktables - https://phabricator.wikimedia.org/T175737#3608528 (10Cmjohnson) @robh both asset tags are on the switch...it's been awhile so not sure how that happened. 1 switch, 2 asset tags. Delete the wmf4199 entry? [17:20:35] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: two switches have same serial in racktables - https://phabricator.wikimedia.org/T175737#3608540 (10RobH) 05Open>03Resolved I've deleted the wmf4199 and updated the notes entry for wmf4503. This will fix the issue, thanks for confirmation about the asset t... [18:27:32] could i reboot acamar in a bit? i was wondering about precautions and issues last time pdns recursor was rebooted and i got this reply https://phabricator.wikimedia.org/T162850#3603007 Icinga part is obvious but "pybal logs" i am not sure where exactly. i mean, sure /var/log/pybal is on lvs servers and i know it's on eqiad, but which of lvs1001-1006 [18:42:11] other topic: this should be the fix for the DNS lint issue / missing MaxMind file https://gerrit.wikimedia.org/r/#/c/377986/ [18:43:52] which software are you using as recursor? [18:50:36] powerdns [18:51:32] I may have a look at your config later [18:51:33] mutante: I think we should still take all the extra precautions for now [18:51:40] which means: [18:52:05] 1) Explicitly depool acamar from eqiad recdns (you can confirm it in logs and ipvsadm -Ln output on lvs1002, should be the active LVS for it) [18:52:40] 2) Take acamar's IP out of the special-cased resolv.conf overrides [18:53:01] err, I'm mixing up codfw+eqiad here [18:53:35] 1) Explicitly depool acamar from codfw recdns (you can confirm it in logs and ipvsadm -Ln output on lvs2002, should be the active LVS for it) [18:54:38] 2) Take acamar's IP out of the special-cased resolv.conf overrides (git grep for 208.80.153.12 in site.pp) [18:55:03] (and make sure that puppet change has applied on the relevant hosts (achernar + lvs200x) [18:55:11] ah! perfect. i was looking where 2 was:) thank you [18:55:17] ok [18:57:43] ns1/baham is a bit trickier to coordinate [18:58:16] uhm, ok, will leave that for last [18:58:17] we can make some router config adjustments and have ns0/radon answer for the ns1 public IP, freeing up baham for reboot [18:59:06] all of baham, radon, eeden have local listening IPs for the virtual ns0/1/2 IPs. It's up to explicit static router config to direct which servers answer which public IPs, basically. [19:00:59] aha! ok. so i think i will first go for lunch, then do acamar today following your instructions and then on another day get back to baham and ask for the router changes [19:07:20] ok sounds great :) [19:31:18] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, 10Operations: Make maps active / active - https://phabricator.wikimedia.org/T162362#3608957 (10Pnorman) I used a SSH tunnel to check maps2001.codfw.wmnet and it's serving tiles fine. One problem I noticed is that it is at least two months out of date on what...