[04:15:05] 10Traffic, 10Operations, 10Performance, 10Services (later): Look into solutions for replaying traffic to testing environment(s) - https://phabricator.wikimedia.org/T129682#2112642 (10mobrovac) Since we are sampling live traffic, should we decline this task? @Eevans thoughts? [06:39:13] 10Traffic, 10Operations, 10Pybal: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849#3488505 (10ema) I've sent a patch upstream covering the virtual service removal case: http://archive.linuxvirtualserver.org/html/lvs-devel/2017-07/msg00016.html [09:01:35] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3488686 (10fgiunchedi) I'll add some thoughts/braindump below: 1. For... [09:10:30] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3488728 (10fgiunchedi) Swift's basic grouping for files are "containers" (or "buckets... [09:19:47] all LVSs except for eqiad upgraded to pybal 1.13.11 [09:20:22] so now in codfw we're doing one-packet-scheduling for recdns (lvs2002/lvs2005) [09:20:52] \o/ [09:20:55] \o/ [09:21:52] UDP traffic patterns to achernar/acamar look normal so far [10:27:02] 10Traffic, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3488793 (10thiemowmde) [10:44:48] I wrote some notes in https://phabricator.wikimedia.org/T167304 about Kafka ACLs based on TLS Auth, let me know if you have any suggestion/comment/etc.. :) [10:45:58] The part of the TLS cert creation/management is still WIP in another task, this one is only for ACLs applied once a client is authenticated [10:47:07] but the general idea that we have for the moment is to map one TLS certificate for each logical Kafka producer consumer, and deploy it to all the hosts that need it (with appropriate perms) [10:48:53] for example: we could create one TLS cert for the logical producer Varnishkafka-upload, deploy it to all the caching upload hosts and instruct Kafka to accept produce requests from this client only for the webrequest-upload topic [10:49:17] (and the cert on the hosts will be readable only by varnishkafka) [12:48:32] ema: it might be worth trying a depool too, we should see all the traffic jump to one side immediately [13:04:43] 10Traffic, 10ArchCom-RfC, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3489044 (10Anomie) [13:13:45] bblack: I've amended https://gerrit.wikimedia.org/r/#/c/368814/ adding some test cases. "/?title=foo&bar=baz" fails, it returns 200 [13:13:54] "/?title=foo" is right though [13:22:25] ok let's try to depool achernar and see if traffic properly jumps to acamar! [13:29:03] looks like it worked: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=17&fullscreen&orgId=1&from=now-1h&to=now&var-server=achernar&var-datasource=codfw%20prometheus%2Fops [13:30:56] nice! [13:31:27] +1! [13:32:15] the LVSs are still using achernar for some reason though [13:32:24] it's in resolv.conf, but it is the second entry [13:33:20] ah, it's pybal :) [13:33:36] right [13:33:51] I think pybal uses its own dns library, not libc's, but parses resolv.conf [13:34:01] nope, it's the dns checks! [13:34:08] oh [13:34:30] so does pybal use some python dns lib, outside of the dns checks? [13:34:59] (I always assume there are edge case softwares somewhere that do, but pybal would be a nice one to be sure about) [13:35:57] socket.gethostbyname IIRC, so libc [13:38:42] https://phabricator.wikimedia.org/T154759#3403250 [13:42:37] so, not libc really, right? [13:43:00] it's a libc-lookalike that tries to parse libc's resolv.conf and act accordingly, but it's its own implementation in python [13:43:25] oh wait, maybe not [13:43:36] I *think* it actually uses libc [13:43:54] right, ok, so it does use libc [13:44:09] but it invokes the libc gethostbyname() multiple times with its own timeout/retry logic wrapping [13:46:20] $ wc -l Modules/socketmodule.c [13:46:20] 5581 Modules/socketmodule.c [13:46:21] heh [13:46:58] freaking github having problems again, I think [13:47:02] oh wait, just slow [13:51:41] bblack: ok to repool achernar? [13:51:57] yeah [13:58:29] looks good https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=17&fullscreen&orgId=1&from=1501593333583&to=1501595883162&var-server=achernar&var-datasource=codfw%20prometheus%2Fops [14:05:05] ema: so on the mobile redirect patch, your tests all pass except the one that says: txreq -url "/?title=foo&bar=baz" ? [14:05:15] I was trying to repro the regex bug on the CLI using perl, but can't yet [14:06:10] nope, "/?title=foo&bar=baz" is the first one to fail :) [14:06:18] test execution stops there [14:07:30] hmm [14:08:40] it seems simple enough, puzzling [14:08:41] $ echo '/?title=foo&bar=baz' | perl -e 'while(<>) { m,^/(w/index\.php)?\?, && m,[?&]title=, && print }' [14:08:45] /?title=foo&bar=baz [14:09:17] (I don't think any of this is complicated enough to differentiate perl from PCRE) [14:10:15] but it does work with PCRE as well [14:10:18] $ echo '/?title=foo&bar=baz' | grep -P '^/(w/index\.php)?\?' | grep -P '[?&]title=' [14:12:08] bblack: ouch, brown paper bag for ema [14:12:29] tests are passing, probably I was running them without your patch [14:12:34] sorry about that! [14:12:47] oh ok [14:12:55] thanks! :) [14:21:18] bblack: ok to carry on with pybal upgrades in eqiad? [14:27:22] yeah [14:27:27] ema: ^ [14:27:45] alright! [14:29:25] interesting side-note while digging into more dns/glibc stuff. glibc has an undocumented 'gethostbyname3_r()' + 'gethostbyname4_r()' (it documents only up to 'gethostbyname2_r()'). [14:30:00] gethostbyname3_r() returns TTL info to the application, which would be a godsend if it was widely standardized and utilized. but instead it's undocumented and there's still no standard interface for it :P [14:30:26] (so the app could actually know: "I can keep this record in app-local memory cache for 326 more seconds before re-querying" or whatever) [14:31:47] oh that would be nice indeed [14:32:18] cachestats.py for example could use that instead of an arbitrary ttl taken from the CLI... [14:33:22] it's even worse than that, as the current glibc API docs say: [14:33:25] "The gethostbyname*(), gethostbyaddr*(), herror(), and hstrerror() functions are obsolete. Applications should use getaddrinfo(3), getnameinfo(3), and gai_strerror(3) instead." [14:33:51] and of course getaddrinfo() and pals don't expose TTL either. So the undocumented gethostbyname3_r() seems unlikely to ever be documented and supported :P [14:35:42] there's a long discussion of related things from back in 2004 at: https://www.ietf.org/mail-archive/web/dnsop/current/msg03174.html [14:36:03] where adding TTL info to getaddrinfo() was rejected because it was already standardized and spreading without it and they didn't want to disrupt the adoption [14:41:50] so _joe_ was mentioning dnspython some days ago, we could use that I guess [14:41:57] >>> import dns.resolver [14:41:57] >>> answer = dns.resolver.query('statsd.eqiad.wmnet') [14:41:57] >>> print answer.rrset.ttl [14:41:57] 3380 [14:47:13] does it do its own thing though? I'd think so, since there's no standard TTL interface [14:48:12] the upside of "do your own thing" (roll your own resolver library/interface in $language) is you can get past the limitations of libc (crappy timeout/failover behavior which is unpredictable across versions/platforms, no TTL info, etc) [14:49:15] the downside is that there's no standardization for parsing resolv.conf (e.g. the more-advanced options that may be recent or in the future), and you have N different resolver library implementations/behaviors to account for in designing reliable resolver/cache infrastructure outside of the application, ... [14:49:35] and then if you go "fix" the libc path with a better NSS module, the applications that bypassed it don't get the benefits [14:52:07] e.g. we might design a better NSS (with flexibility to fail fast and/or in parallel through various stages of server choices), and we might design our infra around it (how we deploy and do maintenance and monitoring on: host-local caches, DC-local caches (possibly w/ anycast), on the assumption that the combination is highly resilient [14:52:41] but then some critical software is using dnspython and parsing resolv.conf instead, and so our intended issue-free maintenance procudure causes it to fail/timeout lookups [14:53:40] or even better: the software supports both libc and DIY stile resolution and it's never clear which method is used when :P [14:54:03] the client side of DNS is thorny [14:54:45] it's pervasive in everything, it can affect everything (breakage or slowness), yet the APIs suck and/or are non-standard, and most software developers (sadly, even developers of DNS client libraries) treat it like DNS is just part of the subtrate of the network and should just be expected to always work. [14:55:20] because it does 99% of the time when they're trying out their code. they didn't think about how time-critically they're relying on which can crash or need a maintenance restart. [14:56:29] well maybe they did think of it, but only to the level of "well we'll have 2+ IPs from resolv.conf, and if we don't get an answer in a second we'll move on down the list" or something [14:57:09] then the other applayer developer is like "I can totally invoke name resolution in this tight loop and it'll be fine, it worked in testing". Or worse "I can do this once at startup and never check again" [15:01:52] (it's hard to really blame them though. what they're given to work with is broken-by-design) [15:07:35] ǥo _joe_ [15:07:42] nooope [15:08:21] years ago in the latter category there was the jvm too, dns names were never refreshed again [15:08:25] good times [15:08:54] yeah! [15:09:21] have you tried to turn it off and on again? :-P [15:09:51] yes, we tried turning hydrogen off and on again, and it broke things worse :P [15:10:14] lol [15:10:21] :) [15:11:02] pybal 1.13.11 running on all LVSs [15:11:59] requests to chromium vs hydrogen are still "unbalanced" though [15:12:40] perhaps we need to clear the IPVS rules between stop and start? [15:14:51] well not so much the rules, it's possible there's some effect from the tables of previous connections or whatever? [15:15:00] but those should eventually expire [15:15:34] right, we do remove/recreate services, so it might be previous table values? Let's wait a bit and see if they expire [15:16:06] weight is 10 for both hosts, I've just double-checked [15:16:15] -> 208.80.154.50:53 Route 10 0 655215 [15:16:18] -> 208.80.154.157:53 Route 10 0 1040824 [15:16:30] ^ I think that's the source of imbalance, and the numbers aren't moving that I've seen so far [15:16:37] but maybe they'll expire in a bit... [15:16:53] oh they are dropping now, slowly [15:18:31] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3489494 (10RobH) [15:19:58] at the rate they're decreasing currently, it looks like ~4h to zero [15:20:05] that seems a little crazy [15:20:29] about every 15s they make a small jump down, by ~0.1% [15:20:56] maybe more like 20s, I donno [15:21:57] ema: what if you depool (remove from ipvs table) one server and put it back, then the other? maybe it will clear the existing rules? [15:22:35] I guess it wouldn't right? because surely the -D on pybal start did more than that [15:22:54] you'd think so [15:22:55] but maybe it intentionally preserves the state when you -D a whole service, but might not on removal of an individual backend? [15:23:11] who knows [15:24:53] I'm going to zero the stats counters so I can see better. I donno if that freaks out some graph. [15:25:10] ok [15:26:05] just for that one service (ipv4 udp dns) I meant [15:26:09] root@lvs1002:~# ipvsadm -Lnu 208.80.154.254:53 --stats [15:26:09] Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes -> RemoteAddress:Port [15:26:12] UDP 208.80.154.254:53 451048 2933352 0 192949K 0 -> 208.80.154.50:53 225524 933476 0 61705938 0 -> 208.80.154.157:53 225525 1999878 0 131244K 0 [15:26:16] ugh at bad paste lines [15:26:24] but anyways, --stats or --rate are telling [15:26:41] "conns" and "CPS" look evenly balanced (OPS scheduler of unknowns) [15:27:01] but InPkts or InPPS still look imbalanced even after wiping the counters' history [15:27:32] so it has to be the memory preserved in those InActConn entries that are extremely slow to expire [15:27:45] 10Traffic, 10Operations, 10Mobile, 10Need-volunteer, and 5 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#3489526 (10Jdlrobson) 05Open>03Resolved a:03Jdlrobson Tested on a mobile device.... [15:30:51] I took a different set of data to look at the time-to-zero and looks more like ~6h heh [15:31:25] I'm gonna try deleting just one temporarily, and see if inactconn clears [15:33:31] nope [15:33:38] inactconn stays there [15:35:09] weird [15:35:25] I did it manually just to be sure I knew what was happening [15:35:39] ipvsadm -d -u 208.80.154.254:53 -r 208.80.154.157:53 -> ipvsadm -a -u 208.80.154.254:53 -r 208.80.154.157:53 -w 10 [15:35:55] but they came back with it :P [15:36:31] what we could do is reboot the relevant LVSes, at least in eqiad [15:36:57] maybe lvs1005 first, then 1002 [15:37:09] although, well, lvs1005 has no state saved anyways [15:37:13] right [15:37:38] so just 1002? [15:38:05] 10netops, 10Operations, 10monitoring: "MySQL server has gone away" from librenms logs - https://phabricator.wikimedia.org/T171714#3489551 (10ayounsi) I don't see this error in the log files anymore, maybe that was temporary during the service move? [15:38:17] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3489552 (10RobH) ``` network ports update: robh@asw-ulsfo# show | compare [edit interfaces xe-1/0/9] - description cp4019.ulsfo.wmnet; + description cp4025; - disable; + enable; [e... [15:38:40] bblack: +1 [15:38:56] 1005 looks like this: [15:38:59] ok, I'll step through it (and disable pybal on 1002 a bit ahead of the reboot, to give a little more grace to the failover) [15:39:00] UDP 208.80.154.254:53 4780650 4780650 0 313776K 0 [15:39:03] -> 208.80.154.50:53 2390325 2390325 0 156890K 0 [15:39:06] -> 208.80.154.157:53 2390325 2390325 0 156886K 0 [15:39:22] yeah, from the brief failover of traffic to it, after you upgraded 1005, when you restarted 1002 [15:39:27] so it does work right when fresh [15:39:33] yes! [15:41:55] it seems like ops mode kills the inactconn entries entirely, which is a nice bonus [15:42:03] instead of junking up a table somewhere pointlessly [15:48:13] ok now they're balanced properly through 1005 [15:48:46] it's coming back to 1002 now [15:49:37] cool, I've seen InActConn flipping to 1 for a sec and then back to 0 :) [15:50:14] oh yeah [15:50:34] I guess ops must be able to set the time, and it sets it to some minimal value (but can't completely avoid it) [16:27:05] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3489707 (10RobH) [16:39:09] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3489738 (10RobH) [16:40:31] 10Traffic, 10Operations: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3489494 (10RobH) a:05RobH>03BBlack These are now ready for use, and are calling into puppet with role spare. [16:46:32] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3489756 (10RobH) [16:46:38] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3481679 (10RobH) 05stalled>03Open [16:46:41] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3489758 (10RobH) [16:51:37] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3489784 (10RobH) a:05RobH>03BBlack cp4022 ready for use. [16:52:41] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3489786 (10RobH) cp402[1-8] are all racked and ready for use. [17:05:23] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3489850 (10BBlack) Excellent news! I'll try to squeeze in replacing one of the clusters ASAP, which will decom another 6x of the old cp to let us move further. [17:10:01] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3489868 (10RobH) [18:38:09] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3490229 (10Fjalapeno) @fgiunchedi We can keep a separate database / list of all the Z... [19:16:27] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3490510 (10Mholloway) Hey @fgiunchedi, > > 1. For production swift ac... [19:24:53] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3490549 (10Mholloway) Chatted with @Fjalapeno about this. Sounds like we'll end up k... [20:57:20] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481495 (10faidon) (T119654 is a restricted task, I have no access to it) While I don't think bonded ports are particularly problematic, I think we should be...