[09:35:11] elukey: I'm ready whenever you are :) [09:36:18] vgutierrez: hola! I am, need to create the CR to move the service to lvs_setup, one sec [09:36:34] ack [09:39:14] I am following https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers [09:39:49] so I'd need to ask to a Traffic team member the backup lvs-es for low traffic :D [09:41:35] I am going to try to see if I can get a list from puppet [09:41:44] and you can tell me how wrong I am [09:43:40] cool [09:45:22] so I see lvs::configuration, and profile::pybal::primary flags, so the backups for low-traffic should be lvs1016 and lvs2020 [09:45:44] but better to check on the nodes themselves probably [09:47:02] so lvs1015 has bgp-med 0 in pybal.conf, and wins :D [09:47:10] anything else that I should check to be sure? [09:47:35] bgp-med is the way to go yes [09:47:41] to be absolutely sure :D [09:48:07] and that's managed by profile::pybal::primary, 10/10 [09:49:59] ok perfect, so I just merge my change, run puppet on O:lvs::balancer, and once done restart pybal on lvs2020 and lvs1016, and check ipvsamd [09:52:05] vgutierrez: --^ [09:52:16] yep, that makes sense [09:52:22] my change being https://gerrit.wikimedia.org/r/c/operations/puppet/+/661687 [09:52:47] ok going to bre..ahem configure pybal :D [09:53:08] elukey: it's plug&pray technology [09:53:12] don't sweat it [09:53:50] ahahahahah [10:00:13] Ask on #wikimedia-traffic connect which are the backup LVS server for the LVS class of your service on both datacenters and restart pybal just on those. If you want to be a friendly neighbor of the Traffic team, before contacting them check in puppet the class lvs::configuration to know what lvs hosts are in high/low traffic classes, and also the profile::pybal::primary hiera flag to spot the [10:00:19] active/standby ones. [10:00:22] does it make sense? [10:02:10] yup [10:02:41] although that bit that says #wikimedia-traffic connect? [10:02:49] I guess it's wikimedia-traffic channel? [10:03:23] ah no that bit was already there, it is a quick link to webchat [10:05:32] oh ok [10:09:05] hi vgutierrez , is the traffic team managing the dns resolvers as well? We sometime have dns errors in mediawiki: PHP Warning: socket_sendto(): Host lookup failed [-10002]: Host name lookup failure [10:10:49] vgutierrez: lvs2010 and lvs1016 look ok, can I move to the primaries? [10:11:19] yep, go ahead [10:17:12] vgutierrez: all good! Thanks for the assistance :) [10:17:18] elukey: np :) [10:21:04] hashar: yep.. it's out of my usual scope but we do handle the DNS resolvers. Could you provide more context? timestamps, number of occurrences, DC, instances suffering the issue? [10:23:20] vgutierrez: we have no idea :] We seem to get some host lookup failures from time to time [10:23:25] https://phabricator.wikimedia.org/T231025 ;] [10:24:37] I should forge a logstash link linking to those I guess [10:30:32] 10Traffic, 10MediaWiki-Debug-Logger, 10SRE, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10Vgutierrez) [10:36:21] hashar: BTW, do you know what's the retry policy on the mediawiki side for DNS queries? [10:42:47] vgutierrez: absolutely no idea :\ [10:43:29] vgutierrez: I guess if it is on your phabricator board that is already an advancement :] Maybe ask on the task ? [10:43:36] it is not that urgent, has been happening for years [10:43:55] maybe some config has to be added in php [10:44:54] ok, thanks [10:45:05] from Triage to DNS Infra on the Traffic board. <-- looks good enough to me for now :] [10:45:07] thx! [10:49:19] and yeah setting a retry might do :] [10:50:10] kids time etc :D [10:52:15] 10Traffic, 10MediaWiki-Debug-Logger, 10SRE, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) [10:53:37] 10Traffic, 10MediaWiki-Debug-Logger, 10SRE, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) [10:53:54] enhanced the task abit, I am rushing to school [10:59:14] 10netops, 10SRE-tools, 10homer, 10netbox: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) p:05Triage→03Medium [11:20:34] 10Traffic, 10MediaWiki-Debug-Logger, 10SRE, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10Vgutierrez) What's the current DNS query retry policy o... [13:18:34] 10Traffic, 10MediaWiki-Debug-Logger, 10SRE, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) On MediaWiki side it uses [[ https://www.php.ne... [14:21:06] 10Traffic, 10MediaWiki-Debug-Logger, 10SRE, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10Joe) >>! In T231025#6745199, @holger.knust wrote: > Thi... [14:39:29] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) I think it might be useful to enforce our user-agent policy at least for this image, and see who comes around complaining, given we don't seem to find a... [16:05:43] 10Traffic, 10MediaWiki-Debug-Logger, 10SRE, 10Patch-For-Review, and 2 others: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) >>! In T231025#6803751, @Joe wrote: > ... > In the specific case, we're trying to resolve `m... [16:15:50] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Gilles) You could even serve another image in its place to this UA, with some text and an email address to contact. You'd probably find out pretty quickly wh... [18:41:09] cdanis: very happy to see progress on T263496, thank you! it's going to be very useful [18:41:09] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [18:41:15] me too!!! [18:41:24] the Python code I wrote in the meanwhile, sukhe ... [18:41:41] yeah :) [18:43:12] assuming the rollout goes well, and I haven't introduced any horrible memory leaks or heap corruptions into our Varnishes 😬, then it should just be a simple matter of modifying the schema [19:03:55] I've seen no evidence of trouble on the two test hosts, so, I'm going to re-enable puppet and let it roll out gradually. [19:04:11] no out-of-the-ordinary increase in memory usage; request rates and error rates and latencies look fine, etc [19:31:37] bblack: around ? [19:31:59] I'm here with ryankemper to talk about T266470, unless you want to jump in a meet [19:31:59] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [19:32:52] taking notes in https://etherpad.wikimedia.org/p/expose-wdqs1009 [19:33:08] gehel: hi :) [19:33:15] o/ [19:33:16] ryankemper: hi too :) [19:33:23] \o [19:33:41] so, I guess my best understanding of the present stuff is what I see in: [19:33:47] context: we want to expose wdqs1009 as a test server so that our users can make sure we're not breaking anything with the new WDQS updater [19:33:54] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#157 [19:34:13] there's several related definitions there for https://query.wikidata.org/ [19:34:24] for different subspaces of the URI routing to different backends [19:34:41] /bigdata/ldf goes explicitly to wds1005 [19:35:07] /bigdata and /sparql both go to the discovery service to a set of wdqsNNNN [19:35:25] but the root url hits our misc webapps setup, for the JS powering the query UI [19:35:29] yep, somewhat similar, but in this case, we want to have a different FQDN (probably query-preview.wikidata.org or something similar) and route to a different server that isn't part of the current WDQS pools [19:36:45] so in this case, I would guess, you'd want no special exception for /bigdata/ldf, and you'd want all of /bigdata and /sparql -> wdqs1009 [19:37:07] which is something we can just define in that file I linked, and adding a name at the DNS layer, and done [19:37:17] but the root URL for the UI part, might be a little trickier [19:37:39] I need to see where that's defined in git, and it might need a new site definition, and to template in a new place to send the UI queries... [19:37:57] but there's no other special requirements I think/expect? [19:37:58] we'll probably want to create another microsite for the JS part, that's fairly straightforward [19:38:16] no other special requirements IIRC [19:38:18] yep, that seems fairly simple to me [19:38:23] so to resume: [19:39:09] create a DNS entry (CNAME to dyna.wm.o), another set of entries in backend.yaml map, create another minisite (with the appropriate configuration) [19:39:22] do we need to do anything for SSL certs? Or is that fully automated? [19:39:49] that part's all shared between everything that runs through the cache, so long as you stick to our canonical domains and just change the leading hostname part [19:40:05] to e.g. query-preview.wikidata.org [19:40:06] "canonical" includes wikidata.org? [19:40:11] ok, good! [19:40:37] yeah for reference, the canonical set is: https://wikitech.wikimedia.org/wiki/HTTPS#For_the_Foundation's_canonical_domain_names [19:41:14] ok, one last stupid and off topic question: what does "dyna" stands for in dyna.wm.o? [19:42:15] it's not a "real" DNS RR-type you would see in the wild. It's a custom thing our authserver implements which provides dynamic A and AAAA records depending on geodns. that's how we route to the closest of our geographic edges for users [19:42:37] https://github.com/gdnsd/gdnsd/wiki/GdnsdPluginGeoip [19:42:41] ^ that, basically [19:42:58] thanks! [19:43:24] templates/wikimedia.org:dyna 600 IN DYNA geoip!text-addrs [19:43:37] ^ is what all the other production CNAMEs are pointing towards [19:43:55] gehel: also I don't know if you've created new service hostnames since the move to netbox [19:44:03] but we don't do commits on the ops/dns repo for that kind of thing anymore [19:44:25] Oh no, I havent [19:44:33] ryankemper: is the one who's goign to do it [19:45:01] I assume it's documented [19:45:02] oh wait, I'm getting ahead of ourselves [19:45:04] * gehel is searching [19:45:13] that part still is done in ops/dns ! :) [19:45:25] now I'm curious anyway! [19:45:44] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/wikidata.org#35 [19:45:49] ^ is where the new hostname goes for DNS [19:46:53] [ https://wikitech.wikimedia.org/wiki/DNS/Netbox if you want to be curious about how Netbox is used to automate a lot of other DNS records] [19:47:26] cool! no more looking for a free IP! [19:48:16] ryankemper: is that enough info to go on to propose patches? are there parts we need to help more with? [19:48:30] if we want to prepare everything and turn it on when ready, we can already prepare the microsite, the mapping and only publish the DNS entry when everything is working [19:48:38] bblack: no I think that's a great starting point, thanks for the context! [19:48:57] anything special about merging the mapping? [19:48:58] wasn't aware of the canonical hostname stuff so glad to know that it's already automated [19:48:58] yeah probably push the microsite update first, then the backends.yaml change in puppet, then the DNS entry last [19:49:14] we'll ping you when we're ready! [19:49:21] ok sounds good [19:49:23] thanks [19:49:29] thanks for the help! [19:49:55] np!