[07:21:04] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Fixed archiva and removed puppet in analytics-in4. The last step is to drop Ganglia and git-deploy terms from common-infrastructu... [07:31:20] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) a:03elukey [08:50:55] https://github.com/cloudflare/gokeyless [08:51:12] go keyless (key) server implementation [08:51:22] with HSM support [08:59:02] BTW, ema: let's deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/444005/ ? [08:59:25] vgutierrez: yup! [09:02:11] !log Bump AES128-SHA traffic redirection to 100% - T192555 [09:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:15] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [09:45:48] 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Volans) This is now part of this quarter goals, moving it as child of T199083. [09:46:14] 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Volans) [10:13:42] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 (10MoritzMuehlenhoff) [10:24:30] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 (10MoritzMuehlenhoff) [12:27:25] bblack: hey! could you take a look at the patches here? https://gerrit.wikimedia.org/r/#/q/topic:T164609+(status:open) VTC tests are green [12:27:26] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [12:27:40] bblack: pcc output looks sane to me https://puppet-compiler.wmflabs.org/compiler02/11728/ [12:55:14] ah, snap, because of '^([^.]+\.)?planet\.wikimedia\.org$' we cannot use literal=true in re2.set() [13:33:48] which is pretty annoying! Either we change all hostnames in cache::alternate_domains to look like 'grafana\.wikimedia\.org', or we add backslashes programmatically in the template [13:40:02] ema: I'd opt for adding some new metadata to the structure in hieradata/role/common/cache/misc.yaml to indicate it's a regex vs a literal (e.g. "regex_key: true") so we know the difference explicitly, get rid of literal=true, and then you can take one of two approaches on regex vs non-regex keys: either paste \Q...\E around the non-regex ones, or use re2's quotemeta function on the non-regexes. [13:41:13] (but then I haven't looked in depth at the whole runtime vs template-time implications of such approaches) [13:42:19] I think we need anchors anyways though, or else it will match substrings. [13:42:37] e.g. "grafana.wikimedia.org" + literal=true might match foografana.wikimedia.org? [13:43:08] we pass ancor="both" to re2.set [13:43:15] ah [13:44:36] sec, looking at the current template stuff [13:45:11] thanks! [13:46:54] it's too bad they don't allow regex options in .add() [13:47:26] anyways, the quotemeta() method should work [13:47:59] remove literal=true, add new metadata to explicitly indicate the entries with regex keys, and do some template-level switching on that between: [13:48:31] re2.add("", ...) [13:49:00] re2.add(""), ...) [13:49:02] I don't see quotemeta in vmod_re2(3) ? [13:49:03] heh [13:49:15] I saw it in https://code.uplex.de/uplex-varnish/libvmod-re2/ [13:49:56] ha! https://code.uplex.de/uplex-varnish/libvmod-re2/commit/e722851ad7911f26b72bb66ed445150da8b2112d [13:50:28] yeah "a month ago" probably before what we have released->packaged [13:50:53] the easiest alternative would be to inject \Q\E around them in the template [14:03:51] bblack: if/when you have a couple of minutes I would like to check with you where/how you want to put an additional script for the DNS repo CI (T182028 for context) [14:03:52] T182028: DNS repo: add CI checks for obvious configuration errors - https://phabricator.wikimedia.org/T182028 [14:04:12] I see that the current two existing CI jobs are just defined in the integration-config repo [14:12:36] do we just want it in CI context? does the script rely on other data (e.g. from puppetdb?) or is it just working from zonefiles? [14:13:03] volans: it would be nice to have it available in general (e.g. to run offline before committing, and/or to run it as a preflight check at merge time too) [14:13:24] (in the latter case, I mean, s/merge/authdns-update/) [14:13:41] bblack: it runs locally based on the files, so adding it to the repo seems natural to me [14:13:46] as each one could run it locally [14:13:54] it's pure python stdlib, no external dependencies [14:13:59] *python3 [14:14:59] right [14:15:20] I assume it just ignores the DYNx lines that use geoip config data [14:15:26] but I don't want to pollute the repo's root, so maybe a script or utils or $goodname dir is better? [14:16:02] so, this is the (other) case where there's some existing split between puppet and dns repos. There are tools for authdns-update and such in the puppet repo, which are run on the dns servers [14:16:23] but yeah, this makes more sense in the dns repo itself [14:16:46] utils/ ? [14:17:04] I usually bad at namings ;) [14:17:10] and then if we want to tie it into jenkins, it's already checking out the dns repo, it can just run the tool from the checkout [14:17:21] utils/ works for me [14:18:02] most likely our next major gdnsd release is going to change some things for better or worse for your checker [14:18:28] we'll be able to get rid of the jinja templating stuff (e.g. {{zonename}}) [14:19:01] (replaced by some other syntax to parse, probably "@@", where the standard "@" is the current origin, and "@@" is the original file-level origin) [14:19:31] so we can do things like "$ORIGIN codfw.@@" within the wmnet file [14:19:40] ok I guess it will not be hard to adapt [14:19:43] right [14:20:01] I've on purpose keep it "simple" and make some assumptions for now based on our usage, it not a generic purpose checker [14:20:08] the other thing, is we'll be adding back the $INCLUDE statement, which can switch origins on inclusion, and we might want to break up our subdomain files at that point [14:20:23] like we always use fully qualified origins, etc.. [14:20:24] $INCLUDE includes/codfw.wmnet codfw.@@ [14:20:51] just things to ponder in the long run, none of it's set in stone yet [14:21:11] ok [14:21:19] but we'll want $INCLUDE anyways to make it easier to integrate manual+automatic data (think netbox autoimport of data) [14:21:32] another question I had is, does gdnsd do some magic 'expansion' of IPv6 PTRs? or this is just a plain bug? [14:21:35] https://github.com/wikimedia/operations-dns/blob/master/templates/0.8.c.e.2.0.a.2.ip6.arpa [14:21:44] bblack: agree [14:21:57] what do you mean about magic? [14:22:18] (no, in general I don't think there's any magic there, nor any bug) [14:22:39] IPv6 syntax allow for compression of the IPs, but I didn't find any doc saying you can compress the PTRs too [14:22:44] you can't [14:23:15] and that file is defining a PTR with 16+1 instead of 32 [14:23:54] oh bug in the zonefile, probably in this case, I thought you meant a code bug [14:24:08] no no, zonefile bug, sorry for not being specific :) [14:25:22] that PTR record is effectively for: 2a02:ec80:0500:0001:1000:0000:000:0000/128 in fully-expanded form [14:25:50] it's probably meant to be: 2a02:ec80:0500:0001:0000:0000:0000:0001 ? [14:26:46] but who knows, maybe it's as it should be, there's no matching forward to compare, would have to check actual network config [14:27:14] for the moment I just wanted to check that it's ok to report it :) [14:27:23] probably! :) [14:27:47] in the phab task there are the links to the two paste with the current script output... and I need to check with ro.b the stuff about the asset tags [14:28:23] if I had to guess, probably cases like: [14:28:29] ERROR: Found 3 name(s) for IP '10.193.1.251', expected 2 (hostname, wmfNNNN): wmnet:3676 mw2269.mgmt.codfw.wmnet. A 10.193.1.251 wmnet:3833 wmf5823.mgmt.codfw.wmnet. A 10.193.1.251 wmnet:4286 wmf6608.mgmt.codfw.wmnet. A 10.193.1.251 [14:29:03] (a) we've sometimes (in the past) put duplicates for wmfNNNN + hostname in .mgmt., but I guess with the low rate of such errors, this is a legacy thing meant to be removed? [14:29:20] and then (b) probably the hardware was replaced and re-used the same mgmt IP as the old, but never removed old wmfNNNN [14:29:35] yeah [14:30:08] oh I think I'm wrong about (a), I think the duplication is expected and normal, seems most hosts have it. [14:30:16] (the hostname + wmfNNNN duplication) [14:30:25] it's just complaining because of the double wmfNNNN [14:36:11] bblack: I went for the easy alternative :) https://puppet-compiler.wmflabs.org/compiler02/11735/cp1052.eqiad.wmnet/ [14:37:15] volans: can double-check stuff like this with dig against one of our nameservers, too: [14:37:18] bblack@alaxel:~/repos/dns$ dig +short @ns1.wikimedia.org 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0.0.0.0.5.0.0.8.c.e.2.0.a.2.ip6.arpa PTR [14:37:21] bblack@alaxel:~/repos/dns$ dig +short @ns1.wikimedia.org 1.1.0.0.0.0.0.5.0.0.8.c.e.2.0.a.2.ip6.arpa PTR [14:37:24] * volans meeting, sorry, might delay replies [14:37:24] ge-2-0-0-2483.mr1-esams.wikimedia.org. [14:37:27] (no answer for the first) [14:38:56] ema: I guess wm_common_recv_pass ends up being common to both the primary+alternate VCLs? [14:40:10] well, in some sense, but not the one I meant [14:40:13] /etc/varnish/wikimedia-common_misc-backend.inc.vcl [14:40:28] but it's also where the re2 set is defined [14:40:59] wm_common_recv_pass is defined in /etc/varnish/wikimedia-common_misc-backend.inc.vcl [14:41:12] and is different from the one for text [14:42:48] every re2.set() I see in the cp1052 diff looks the same (msic or text, fe or be, all in common) [14:43:16] ah, you meant wm_common_domains_init? [14:43:20] yes [14:43:32] yes, we should not define the set in the alternate VCL [14:43:47] yeah, maybe [14:44:08] although arguably, we could also replace or huge if-chain with one that has backends as data in the alternate VCL, too. [14:44:13] but kinda optional at this point [14:44:37] set_backend__ I mean [14:44:52] gets complicated with subpaths though [14:44:58] mmh yeah [14:45:03] hi everybody, I'd need to disable puppet on all the cp hosts running vk-eventlogging to safely apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/444232/ - anything against me proceeding? [14:45:54] elukey: no objections on my side [14:47:27] yeah no objection, although I think this won't be the end of that latency ticket in general [14:48:10] compression is going to help, but there's something more-fundamentally wrong there (e.g. kafka's code or protocol design can't have enough data in flight to make good use of a high-BDP link, so it falls behind?) [14:48:54] it's the same kind of issue you see with slow scp transfers on high-BDP links, the author's didn't consider the high-BDP case in their design, so adding latency reduces throughput. [14:49:20] BDP: https://en.wikipedia.org/wiki/Bandwidth-delay_product [14:50:08] (BDP is effectively how many bytes can be in flight along a link, between the two endpoints. For high-latency but high-bandwidth links, the number can be quite high, and you have to be able to fill the pipe with unack'd data if you want to make good use of it) [14:51:08] bblack: yeah I think it'll require more work, this is a first stab, thanks for the pointers :) [14:53:01] 10Wikimedia-Apache-configuration, 10Operations, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Joe) [14:53:31] so if the higher-layer protocol (say it's HTTP or kafka or ssh or whatever) requires some kind of round-trip acknowledgement of each "transaction" or "message", and they're typically small messages (or limited to a small maximum size per message), you run into these issues. [14:54:07] e.g. our hypothetically protocol may be limited to 4000 bytes per message (or just has a typical message size ~4000 bytes) and require each to be roundtrip ack'd before sending the next. [14:54:25] if the link latency is 1s, you're now limited to 4000 bytes per second, even if the link could carry 10Gb/s [14:55:47] (and the answer is to tweak the config parameters or protocol design to allow a lot more data to be in flight without serializing the acks of each one) [14:59:08] yeah I remember that you tuned the vk webrequest config a while ago (better - the librdkafka batch configuration) [15:00:18] I don't have any recollection of it, it's probably slipped out of my short history window, I don't presently have much idea how kafka's protocol or tuning works [15:00:43] but yeah, sounds plausible there could be config affecting it! [15:06:19] bblack: (took me a bit but I found it :) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/314336/ [15:06:43] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10Papaul) [15:07:47] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10Papaul) @ayounsi osm-web2001, db2021, db2022 and db2024 are not showing in icinga so i don't know what is the update on those servers [15:09:10] that's pretty cool: https://map.internetintel.oracle.com/ [15:13:25] elukey: 9K still seems like a reasonable "peak rps estimate" in general (I tried recalculating a reasonable max scenario today and came up with 7500, so meh) [15:15:22] and our low-point for rate in eqsin would be upload's daily lows at 2K/sec, so even if batching never sends early, it would still send every ~4.5s, which is well under the 300s timeout [15:16:13] hmmmm [15:18:07] yeah that whole thing needs more digging. understanding the actual protocol and code behaviors is probably necessary. [15:19:32] yes indeed [15:19:34] note also nearby the "topic_request_timeout_ms" setting and its text [15:19:38] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10Papaul) a:05Papaul>03Cmjohnson @Cmjohnson hey Chris assigning you the task so you can do your audit. Once done you can assign it to @ayounsi. Thanks. [15:19:43] maybe that needs re-raising a bit, all things considered [15:20:11] (but we might want to raise the max buffered setting with it, to be safe) [15:20:38] in theory webrequest metrics looks very good from my point of view, eventlogging is the only instance that sometimes drops data, this is why I am concentrating to it for the moment (to avoid too many changes/variables) [15:21:07] ah, right, I was thinking only in webrequest terms [15:21:22] eventlogging might have such a low rate that it's batching up too much [15:23:07] patchset updated to avoid defining the hostnames set in alternate VCLs: https://puppet-compiler.wmflabs.org/compiler02/11737/cp1052.eqiad.wmnet/ [15:26:16] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030 (10ayounsi) Did the loop test on cr3-ulsfo but still no stable link. Update from JTAC: > I checked in my lab, I am able to bring up this link, however I couldn’t find Fiberstore optics in... [15:33:15] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10ayounsi) [15:33:28] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10ayounsi) Thanks, switch port description updated. [15:35:18] bblack: I'm back sorry for disappearing [15:35:52] yes we have a kinda policy of having both wmfNNNN.mgmt and hostname.mgmt for hosts that might change hostname in time, but we kinda don't have it for other kind of hardware [15:36:28] so for now if there is a missing wmfNNNN is considered a warning, while if there are 3 or stuff like that is considered an error, most likely due to old record not being removed [16:02:29] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) a:03RobH [18:14:38] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410 (10Urbanecm) [18:30:23] 10netops, 10Operations, 10Goal: Increase network capacity (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T199142 (10ayounsi) [18:31:26] 10netops, 10Operations, 10Goal: Increase network capacity (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T199142 (10ayounsi) [18:32:05] 10netops, 10Operations: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [18:32:30] 10netops, 10Operations: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147 (10ayounsi) [19:14:40] I am getting "blocked" response when trying to access wikidata from wdqs1009 [19:14:54] is there something wrong? [19:24:01] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) [19:27:29] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) p:05Triage>03High [19:29:23] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) [19:32:52] SMalyshev: can I try to reproduce it on my side? [19:33:04] XioNoX: sure [19:33:14] what's the command, etc.. ? [19:33:18] XioNoX: see T199146 [19:33:28] T199146: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 [19:33:35] XioNoX: basically access https://www.wikidata.org/wiki/Q18289639?action=constraintsrdf&nocache=1531160117052 from production )like with curl) [19:34:58] what's the IP? [19:35:14] I guess the internal v6 for wdqs1009? [19:35:21] XioNoX: possibly [19:35:38] sent you by private chat [19:36:38] thx [19:37:28] so it's webproxy IP... not sure why it's banned or why it thinks it's an edit [19:37:55] I can't help more here [19:39:13] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) [19:41:51] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Jonas) My IP is also blocked {F23514678} [19:45:22] XioNoX: thanks for trying anyway :) [19:45:58] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10BBlack) This raises some questions that are probably unrelated to the problem at hand, but might affect things in... [19:49:19] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) > Why is an internal service (wdqs) querying a public endpoint? It needs to load Wikidata data, and... [19:51:14] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) Yes, I verified, I get the same block with `curl --noproxy \*`. Just different IP in the error message. [19:51:30] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Gehel) >>! In T199146#4409455, @BBlack wrote: > This raises some questions that are probably unrelated to the pro... [19:53:23] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) For the record, `204 NO CONTENT` or 200 with RDF output is the right answer. For most items, it's 204... [19:57:01] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Gehel) It looks to me that the block is done by mediawiki itself (see P7355 for details): < x-cache: cp1066... [20:04:15] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) Yeah looks like ipblocks table for wikidata has block on `2620:0:862:101:0:0:0:0/96` by user "Merlissi... [20:08:28] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Gehel) >>! In T199146#4409514, @Smalyshev wrote: > Yeah looks like ipblocks table for wikidata has block on `2620... [20:19:13] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) This block seems to be driven by `$wgSoftBlockRanges` setting in CommonSettings.php, which includes `$... [20:35:59] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Mahir256) Hi @Jonas: I blocked that particular range, among others allocated to Telefonica Germany, in an attempt... [20:38:15] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: "Blocked" response when trying to access constraintsrdf action from production host - https://phabricator.wikimedia.org/T199146 (10Smalyshev) Looks like we don't need to change blocks - instead, 'constraintsrdf' should be marked as read action...