[07:18:05] 10Traffic, 10Operations, 10ops-codfw: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10elukey) p:05Triage>03High [09:56:32] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10Krenair) >>! In T174596#4740124, @aborrero wrote: > BTW, I'm focusing on the `eqiad1` deployment... [10:14:40] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) >>! In T174596#4741643, @Krenair wrote: > I guess there's still the question of whethe... [10:14:54] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) 05Open>03Resolved [10:48:16] https://www.zdnet.com/article/firefox-and-edge-add-support-for-googles-webp-image-format/ [11:16:06] 10Traffic, 10Operations, 10Pybal: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 (10ArielGlenn) [12:38:07] webp support for firefox has been implemented for firefox 65, which should be released at the end of january [12:38:37] between that and edge shipping it last month, the timeline to get to a majority of our traffic supporting webp is a lot faster than I thought [12:41:44] https://caniuse.com/#feat=webp this has been updated [12:53:31] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10ayounsi) [13:12:13] gilles: nice! [13:14:58] sigh [13:15:00] > The logging.yaml file defines all custom log file formats, filters, and processing options. The file itself is a Lua script. [13:15:18] this is how trafficserver's documentation for logging.yaml begins /o\ [13:16:04] I've proposed https://github.com/apache/trafficserver/pull/4601/files after a decent amount of cursing [13:49:01] bblack: let me know if there's a window to lower the webp threshold this week [14:00:07] 10Traffic, 10Operations, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) Seems to be testing fine on https://pinkunicorn.wikimedia.org/ , and the pre-deployment to all caches hosts and OCSP Stapling looks fine too. The skew window for the transitio... [14:02:03] nice bblack :D [14:57:54] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Cmjohnson) @ottomata Lets go with cloudvirtan1xxx. [15:34:29] 10Traffic, 10Operations, 10ops-codfw: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Vgutierrez) The system is online since 07:30 UTC [16:17:50] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ok! [16:20:11] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) @Cmjohnson I updated {T207194} to reflect the new naming. Please proceed and then assign to Cloud VPS folks f... [16:28:06] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Vgutierrez) we will be replacing lvs2006 with lvs2010 [16:28:49] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [17:09:34] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [17:09:46] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [17:09:50] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [17:10:21] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [17:33:26] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10elukey) @mforns created https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/prometheus-varnishkafka-exporter [18:33:04] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul we need to re-wire lvs2009 & lvs2010 to connect the first interface (enp175s0f0) to the main row for each server. [18:33:58] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [19:13:58] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez the first NIC of each server is connected to the switch where the server is racked in. Example: lvs2010 is racked in D2 so the first NIC is connect... [19:20:13] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul so at least in lvs2010, debian installer seems to think that enp175s0f0 is the first NIC, the mac addr is 00:0a:f7:f0:0c:10. in lvs2009 the mac addres... [19:46:24] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez on lvs2010 can you tell me which interface has this MAC address Routing instance : default-switch Vlan MAC MAC... [19:50:23] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul enp59s0f0 [19:57:16] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez which position 2nd, 3rd or 4th? since enp175s0f0 is 1st [22:02:47] 10netops, 10Operations, 10cloud-services-team (Kanban): Permit routing from eqiad1-r instances to labnet1001 - https://phabricator.wikimedia.org/T209424 (10Andrew) [22:30:56] 10netops, 10Operations, 10cloud-services-team (Kanban): Permit routing from eqiad1-r instances to labnet1001 - https://phabricator.wikimedia.org/T209424 (10ayounsi) Pushed to cr1/2-eqiad `lang=diff [edit firewall family inet filter cloud-in4] [...] + term labnet-nova-api { + from { +... [22:33:59] 10netops, 10Operations, 10cloud-services-team (Kanban): Permit routing from eqiad1-r instances to labnet1001 - https://phabricator.wikimedia.org/T209424 (10ayounsi) 05Open>03Resolved [22:45:09] mutante: I've captured one of the faulty requests from varnish's pov, triggered by a curl fetch. So, it's not browser-specific, and it's not any kind of cached content. [22:45:43] mutante: in the case I captured, the varnish backened had just opened a fresh connection to vega, and sent "Host: bienvenida.wikimedia.org", but got back bugzilla-static content... [22:46:59] bblack: hmm, ok! unfortunately i cant use apache-fast-test from the cumin host because of ferm rules only allowing it from cache misc [22:47:17] so it's a vega problem [22:47:35] yeah I'm still digging [22:47:48] also curious, and probably pointing at some other aspect of whatever's wrong [22:47:50] what server is vega using? [22:48:10] I wonder how this could happen [22:48:20] apache2 2.4.25 [22:48:20] if I repeatedly do a curl-to-self locally on vega for http://bienvenida.wikimedia.org/ (no extra headers for HTTPS), it randomly answers with either the site content, or an HTTPS 301 redirect... [22:49:43] could the 301 be an artifact of the STS header? [22:51:15] mutante: the reimage script uses the apache-fast test via cumin, just connecting to the deployment host and running it from there [22:51:20] there is no STS header [22:51:30] static-bugzilla is the only vhost using RewwriteCond and RewriteRule for %{HTTP:X-Forwarded-Proto} !https [22:51:46] i am tempteed to just lower the priority of that [22:51:55] or if you need to run it across many hosts, see the warmup steps in the cookbooks for the switchdc [22:52:10] to run specific URLs against the mw cluster [22:52:25] volans: ok! gtk, thx [22:52:33] mutante: so yeah, if I add an artificial ~"X-F-P: https", I get random bugzilla-vs-bienvenida [22:52:44] is priority really how apache determines defaults? [22:53:16] I think it's supposed to be _default_ [22:53:41] default server static-bugzilla.wikimedia.org [22:53:53] and the config file starts with 20- [22:54:00] where? [22:54:00] all others are 50- [22:54:11] it says default in the output of apache2ctl -S [22:54:19] I get the 20-/50- sets the order that includes are processed [22:55:15] ah [22:55:22] from apache docs: "The string _default_, which is an alias for *" [22:55:29] so it's not that, they're all default in some sense (heh) [22:58:19] bblack: let's just remove that rewrite rule entirely [23:01:02] well let's figure out what's happening first [23:01:16] I'm going to disable puppet on vega for a little and experiment... [23:02:24] ok, i will not touch vega config .. am on bromine though [23:06:27] mutante: are you editing bromine? [23:06:49] bblack: no, not in any config file [23:07:00] oh nevermind, IP address thing [23:07:38] was reading apache docs and VirtualHost *:80 it's like https://httpd.apache.org/docs/2.4/vhosts/examples.html#purename suggests it [23:07:55] well [23:08:13] "The asterisks match all addresses, so the main server serves no requests. Due to the fact that the virtual host with ServerName www.example.com is first in the configuration file, it has the highest priority and can be seen as the default or primary server. " [23:08:13] the one thing that's missing by the docs, is somwhere in the main config should be a "NameVirtualHost *:80" [23:08:36] but I don't think it's actually necessary [23:08:49] i think that became default [23:08:54] at some point [23:09:16] heh [23:09:17] https://stackoverflow.com/questions/34701809/namevirtualhost-has-no-effect-and-will-be-removed-in-the-next-release [23:09:28] The NameVirtualHost directive no longer has any effect, other than to emit a warning. Any address/port combination appearing in multiple virtual hosts is implicitly treated as a name-based virtual host. [23:09:30] anyways, I was going to restart with that config and see if it affected anything [23:09:32] since 2.4 [23:09:46] but I tried just restarting apache2 first, and the problem vaporized [23:10:14] eh.. you mean.. restarting apache fixed it?? [23:10:26] also, when I was tailing all the logs for this, the misdirected requests would create no log entry anywhere, even though legit requests to both sites were being logged :/ [23:11:02] so far, it seems to have fixed it, yes. [23:11:08] wow [23:11:20] so, random fault in apache code or the VM's memory? no idea [23:11:39] but in general,..do we need those rewrite rules for non-https to https ? [23:11:42] nowadays [23:12:08] that's complicated :) [23:12:24] did all the other virtual hosts just have http->https due to the rules in static-bugzilla ? [23:12:31] heh, ok [23:12:42] right now they're doing nothing in this case, because it's sitting on a private network and only varnish is accessing it, which enforces the user-facing HTTPS, and only uses HTTP to contact apache [23:13:02] but eventually, in some future world, our backend caches will be ATS and will be making HTTPS connections to all internal services [23:13:22] at which point we'll want to enable+enforce internal HTTPS using internal certs, on all these small private services on the inside [23:13:38] and you'll want these redirects again [23:13:47] aha! ack [23:13:54] (or even, just drop port 80 entirely, for internal IPs) [23:14:12] for now I'd say not worth messing with the status quo, till we run through all such configs later to make them HTTPS-only [23:14:18] that sounds almost better.. like fewer things that can break with the rewrites [23:14:23] yes [23:14:49] so we assume that only vega was serving bad answers [23:14:53] and bromine never did [23:16:29] and i do not get served by codfw even though i am on the west coast [23:16:35] and both backends are enabled [23:16:41] and that's why i never saw this myself [23:18:17] ok [23:18:41] it's frustrating not to have a real answer. I wouldn't expect apache to be randomly-faulty and fix itself on a restart. [23:20:04] it could be something like this: https://serverfault.com/questions/238400/apache-routes-traffic-to-wrong-virtualhost-the-first-time-it-is-started [23:20:38] where the first start of apache2 after the VM boots happens too fast, and the parsing of the *:80 races with the definition of the virutal network interfaces and causes misbehavior. there's also fixes on a later restart. [23:20:54] s/there's/theirs/ heh [23:22:43] or maybe something related where our current config requires a restart after defining new virtualhosts (since I assume this one was recent) [23:23:17] yea, it was fairly recent [23:24:16] or, maybe puppet did restart/reload apache, but somehow the soft restart still had 1x process leftover still running and trying to shut down from the old config, which didn't recognize the new hostname? [23:24:28] (and competing for port 80 traffic) [23:24:35] i am checking my own bash_history to see if i restarted it [23:25:11] that process could've been stalled on a long-lived persistent conn from varnish, too [23:25:40] in any case, I'm out of the hosts and they should be in normal state, and no more repro [23:26:13] while staring at this, I noticed a few somewhat-unrelated things about cache_misc and cookies and caching [23:26:17] bblack: thank you! it would have probably driven me crazy before i did the restart :p [23:26:28] and the timing ... [23:26:38] 1) these microsites don't emit any cacheability headers, so they're getting the default ~24h TTLs, I guess that's ok? [23:27:03] that is normally ok, yes. except maybe on the day people go live [23:27:20] well we can always wipe the cache for a given hostname if we need to [23:28:08] 2) We have some cookie-parsing rules for the alternate_domains VCL (the artist formerly known as cache_misc), because otherwise a bunch of common non-session cookies get seen as "omg cookie, disable all caching" [23:28:30] where we exclude our own GeoIP cookie, the common global _ga cookies from Google, etc [23:28:47] but we're not excluding our own piwik cookies that sites like bienvenida use [23:29:55] so it commonly becomes accidentally uncacheable after the first page load [23:30:40] because UAs are sending the piwik cookies "_pk_ses.X.Y=*; _pk_id.X.Y=", where X and Y vary and are some hex ids [23:30:43] bblack: so regarding 24h TTL.. i think now is such a case and we want the purge [23:31:01] has there been a recent update? [23:31:03] because atgomes is now seeing the cached version of the wrong site.. i think [23:31:10] atgomez [23:31:12] yeah, i still have the bugzilla site [23:31:22] are you sure it's not browser-cached? [23:31:25] i've cleared it [23:31:30] could be [23:31:35] will try another browser as well [23:32:11] yeah i'm seeing it across browsers after clearing caches [23:32:37] my other thought was office wifi might be contributing? i know basically nothing about how this stuff is configured, but sometimes office wifi does weird things [23:33:20] how about something like https://bienvenida.wikimedia.org/?12345 [23:33:46] that one works [23:34:05] just checked the plain one on another machine and got the same thing [23:34:08] (the bugzilla) [23:34:29] the office wifi can't interfere with this [23:34:59] bblack good to know… there goes my theory [23:36:30] atgomez: try now [23:36:48] (you may have to clear browser again, but it should be fixed) [23:37:27] bblack: lgtm! [23:37:32] THANK YOU [23:37:37] :) [23:38:07] np [23:38:21] mutante: so I did a manual varnishadm ban on the whole domainname, to cache_text [23:38:32] mutante: I think that's documented somewhere as a workaround for these kinds of things [23:38:51] lol, the docs reference "salt" of course [23:38:55] should we add something like "Header set Cache-Control " to lower the 24 hours then [23:39:04] heh, i think i remember those docs, ack [23:39:56] https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges_(bans) [23:40:20] that documentation is very meandering and awful, and also is mostly salt-based [23:40:28] it does have some cumin examples at the bottom though [23:40:51] i see them.. ok.. so that minus the cluster part [23:41:12] and be careful to now purge more than intended [23:41:15] not [23:41:26] for a low-traffic situation like this, I didn't even try to time out the datacenters, I just did 2 runs: backend then frontend, like this: [23:41:39] bblack@neodymium:~$ sudo cumin 'A:cp-text' 'varnishadm ban "req.http.host ~ bienvenida.wikimedia.org"' [23:41:50] sudo cumin 'A:cp-text' 'varnishadm -n frontend ban "req.http.host ~ bienvenida.wikimedia.org"' [23:42:21] saving that too [23:42:27] (and yeah the dots should be escaped in that regex, or it should just be an ==, but I was being lazy and the net effect is the same either way) [23:44:12] heh, *nod*. so if we wanted to change content more often than 24, i woudl copy existing config like: [23:44:29] Header set Cache-Control "max-age=14400 must-revalidate" [23:45:05] that is from OTRS, fwiw [23:45:36] that's wrong anyways heh [23:46:30] mainly what's wrong is that fields should be comma-separated, e.g. "max-age=14400, must-revalidate" [23:47:07] ok, i found the one that was wrong :) [23:47:08] hieradata/labs/deployment-prep/common.yaml: 'Cache-Control': 'public, s-maxage=360, max-age=360' [23:47:08] I'm iffy on must-revalidate, but since we're not adding magic to supresses these headers from browsers, I guess leave it in [23:47:39] the public part is mostly irrelevant for most of our cases [23:48:19] re: the max-age, there's tradeoffs, and we don't like short values for high-volume sites [23:48:31] but for a low-volume static site, you could maybe compromise at an hour [23:48:45] "max-age=3600, must-revalidate" [23:49:02] but with fewer typos [23:49:24] oh that is right, it just looked funny at this point, I've typed "revalidate" too many times [23:49:28] it's like saying a word over and over [23:49:49] atgomez: one hour sounds good, right [23:50:23] that would be the maximum time people still see an old version after you make updates [23:50:28] mutante: yeah, that would be swell