[01:55:11] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DStrine) [06:29:59] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops, 10FR-Email: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10CCogdill_WMF) [10:33:30] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Daimona) [11:47:24] 10Traffic, 10Analytics, 10Operations: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [11:47:49] 10Traffic, 10Operations, 10Patch-For-Review: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10ema) [11:47:53] 10Traffic, 10Analytics, 10Operations: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [11:48:22] 10Traffic, 10Analytics, 10Operations: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) p:05Triage→03Medium [11:49:46] 10Traffic, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Elitre) [11:50:47] 10Traffic, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Elitre) [12:56:42] 10Traffic, 10DNS, 10Matrix, 10Operations, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10fgiunchedi) p:05Triage→03Medium [12:56:58] 10Traffic, 10Analytics, 10Operations: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) When it comes to varnish-modules, our current version (0.12.1-1+wmf2) does not build against 6.0.x, and same goes for varnish-modules 0.16.0 currently in testing. Luckily though, with a few changes... [15:09:29] 10Traffic, 10Operations: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) 05Open→03Resolved a:03CKoerner_WMF Thanks @Nintendofan885 for the reminder. Resolved! [16:20:25] 10Traffic, 10Maps, 10Operations, 10Wiki-Loves-Monuments (2020): wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T261506 (10TOR) If this is not fixed today (within the next couple hours) we will be forced to use the English-language interface at... [16:23:41] btw, I'm going to fix that, they are an affiliate, listed on metawiki, which is what I'm choosing as the bright line for the time being [16:25:26] cdanis: ack, sounds good [16:25:35] (and thanks!) [16:29:02] bblack: do you have a moment to glance at https://gerrit.wikimedia.org/r/c/operations/puppet/+/623416/ ? [16:30:26] yeah [16:30:31] 10Traffic, 10Analytics-Radar, 10Operations, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Milimetric) [16:33:28] lgtm! [16:33:36] ty! [16:35:20] 10Traffic, 10Maps, 10Operations, 10Patch-For-Review, 10Wiki-Loves-Monuments (2020): wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T261506 (10CDanis) 05Open→03Resolved A fix has been merged and should take effect within the next half hou... [16:43:59] 10Traffic, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10CDanis) FTR, in T261506 I added wikimedia.pl to our list of allowed domains. * They're an affiliate, listed on metawiki for some time, whi... [16:46:21] 10Traffic, 10Operations: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) Was this done on blog.wikimedia.org itself without needing a change by SRE? [16:58:29] 10Traffic, 10Operations: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) Yes, this was handled on the host side of things. Sorry for the noise. [17:14:14] 10netops, 10Operations, 10ops-eqiad: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10RobH) [17:14:17] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10RobH) [17:52:37] bblack: by any chance are you around for a gdnsd question? [17:52:43] yup [17:53:55] so, in these days both running test and the real services switchover we noticed that the TTL change check was failing the retries, but re-running it worked fine. All evidence points out in the direction that we need more time to converge to the new data [17:54:11] but this doesn't seem to happen for the pooled/depooled change [17:54:38] so I was wondering if when reading the confd-generated files, gdnsd has a different behaviour if changing the UP/DOWN vs changing hte TTL value [17:55:00] that might not be reflected immediately when making dns queries [17:55:01] I don't think it does, no [17:55:10] is it possible the check is hitting the caches? [17:56:38] it shouldn't but le'st recheck it, also changing the TTL from 300 to 10 the checks succeed in <<300s, more like 20, depending on how many records are changed together [17:56:59] right, but caches are weird, they have many threads with independent cache contents, et [17:57:02] the other question is if gdnsd has to reload the whole daemon for TTL changes or someting that might take 1~2s and does that per record [17:57:02] c [17:57:38] can you point at the code that makes the TTL change and checks it? that might be easier to figure it out from [17:57:53] (are these admin_state TTL changes, or zonefile TTL changes?) [17:57:59] sure, we query for 'A:dns-auth' and then call port 5353 [17:58:12] setup is at [17:58:13] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/dnsdisc.py#49 [17:58:33] and check_ttl at line 141 [17:58:46] ok [17:58:52] to keep you up to date, we had first an issue with confd [17:59:16] that was solved by j.oe chaning watch with poll every 3s (let me find the phab post with more info) [17:59:37] but now from syslog the confd files seems to be updated correctly and promptly [18:00:06] context on confd issue: https://phabricator.wikimedia.org/T260889#6411144 [18:00:51] the "problem" with the admin_state mechanism in general is that it's not synchronous [18:01:18] it should be fast, but there's no determinstic way to wait on it taking effect [18:01:29] (other than maybe tailing syslog outputs) [18:02:19] when was the change in question? [18:02:53] e.g. authdns1001 has this in syslog from ema downing eqiad front edge: [18:02:56] Aug 31 14:58:59 authdns1001 gdnsd[10289]: admin_state: (re-)loading state file '/var/lib/gdnsd/admin_state'... [18:02:59] Aug 31 14:58:59 authdns1001 gdnsd[10289]: admin_state: state of 'geoip/generic-map/eqiad' forced to DOWN/MAX, real state is NA [18:03:00] like today around 14:11 [18:03:02] Aug 31 14:58:59 authdns1001 gdnsd[10289]: admin_state: load complete [18:03:11] and then 14:13 for the second round [18:03:42] ok I see [18:03:47] those are all dns discovery records, those like /var/lib/gdnsd/discovery-zotero.state [18:03:54] that I fail to see in syslog [18:04:28] on the gdnsd side, I just see the confd side [18:05:38] empirically I can say that changing more records at the same time needs more time to get them propagated [18:05:46] 10Traffic, 10DNS, 10Matrix, 10Operations, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10bcampbell) Vendor says it looks to be all correct now. They shared this link: https://federationte... [18:05:54] right [18:06:23] the other changes circa 14:20, you can see the gdnsd reload of them [18:06:32] Aug 31 14:20:39 authdns1001 gdnsd[10289]: state of '10.2.1.55/discovery-state-api-gateway' changed from DOWN/10 to UP/10 [18:06:39] but not the earlier ones at 14:11 or 14:13 [18:06:53] really I don't see 14:13 even from confd on authdns1001 [18:07:02] 10Traffic, 10DNS, 10Matrix, 10Operations, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) 05Open→03Resolved a:03CDanis [18:07:17] that's because we re-run the cookbook that is idempotent [18:07:19] so was the same change [18:07:31] and no confd dile was changed, basically is a trick to re-run the tests [18:07:39] and ensure that all matched [18:07:42] and it did succeed [18:08:11] sorry I didn't explain it myself correctly earlier the confd changes were only at 14:11 [18:08:28] Aug 31 14:11:40 authdns1001 confd[30865]: 2020-08-31T14:11:40Z authdns1001 /usr/bin/confd[30865]: INFO /var/lib/gdnsd/discovery-zotero.state has md5sum 8b7b209a9d2aad9c980a950e414331d8 should be 79147d8db4f272e13c6be99feb02d024 [18:08:31] Aug 31 14:11:40 authdns1001 confd[30865]: 2020-08-31T14:11:40Z authdns1001 /usr/bin/confd[30865]: INFO Target config /var/lib/gdnsd/discovery-zotero.state out of sync [18:08:34] Aug 31 14:11:40 authdns1001 confd[30865]: 2020-08-31T14:11:40Z authdns1001 /usr/bin/confd[30865]: INFO Target config /var/lib/gdnsd/discovery-zotero.state has been updated [18:08:38] so it does claim it really changed the file contents [18:08:44] but yeah, no message from gdnsd about the change being noticed [18:08:53] and then later the check succeeded around 14:13? [18:09:06] yes [18:09:15] I can have exact times if you need them from the cookbook logs [18:09:28] nah [18:10:43] in that specific case the cookbook failed at [18:10:49] 2020-08-31 14:12:08,407 rzl 8957 [DEBUG dnsdisc.py:209 in resolve] [authdns2001.wikimedia.org] search -> 10.2.2.30 TTL 300 [18:10:59] was expecting to find 10 as TTL [18:12:01] volans: is this because of the new 'unified' DNS setup? with a recursor in front of the authdns now? [18:12:01] mmmh interesting [18:12:13] cdanis: no we query the auth part directly [18:12:16] ah ok [18:12:32] bblack: it seems that the confs change on that record was right before the failure [18:13:05] I'm wondering if it's an etcd delay... I'll dig on that side too [18:13:44] but weirdly it didn't happened for the pooled/depooled state so far [18:14:37] so the log output missing is apparently normal. for whatever reason, the ghost of brandon past decided to only emit a log message if the down-ness changed, not the ttl [18:15:03] I assumed so yeah :) [18:15:05] but that doesn't seem to affect applying the data itself, just the message [18:15:28] weird I can see confd changing the search file at 2020-08-31T14:11:59Z and 2020-08-31T14:12:07Z [18:15:57] ack [18:16:06] would a full second delay be enough to mess up your check? [18:16:08] thx for checking and confirming there is no difference in behavioir [18:16:47] when any of these updates are applied (state or TTL), if there's a rapid-fire series of changes, the code kicks in a 1-second queued to batch up the changes [18:17:20] basically the first change from a long-idle state happens ASAP, then as long as you keep spamming rapid updates, it will apply whatever has changed so far in batches once per second [18:17:33] (all of these discovery file changes, etc) [18:17:44] ok (I was sure there was a good design behind ;) ) [18:18:00] and that's async so you could well win the race with it [18:19:07] I had to look at the code to even remember that was there [18:19:11] ahahaha [18:19:47] the gdnsd of the future won't have this stuff [18:19:54] I think we can exclude gdnsd for the moment, some logs from confd don't make sense to me right now [18:20:26] at some point in the running design updates doc, I decided that all asynchronous/automatic things are the devil, and every kind of change should be explicitly-triggered and synchronous from the requesting pov [18:20:45] but who knows when I'll get around to working on future-gdnsd again [18:21:31] :) [18:21:32] even gdnsd-3.x has undone some of the worst of the past async/automatic bullshit, but not all of it [18:22:32] (admin_state, these discovery files, and geoip database updates are the three big cases I know of that are still like that) [19:18:04] 10Traffic, 10Operations, 10conftool, 10serviceops, 10Patch-For-Review: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Volans) After today's failure of the `check_ttl` step in the switchdc of the services, I had a ch... [19:18:32] bblack: my findings so far if you're curious ^^ [19:18:57] ( T260889#6424881 if you ignore wikibugs ) [19:18:57] T260889: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 [20:33:34] 10Traffic, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10JMinor) @CDanis Yes, I will make a subtask for tracking these and any other affiliated domains that need an exemption. We're also send... [20:34:29] 10Traffic, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10JMinor) To this specifically: > Someone recently mention to me "the only way to prevent WMF people from doing making stupid mistakes thes... [21:03:56] 10Traffic, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Support maps serving for affiliate sites via an allow list. - https://phabricator.wikimedia.org/T261694 (10JMinor) [21:04:22] 10Traffic, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10JMinor) [22:04:15] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) 05Open→03Declined [22:46:12] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Urbanecm) Just out of curiosity, why was this declined @DStrine? [23:02:56] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) 05Declined→03Open [23:04:18] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) This is done from my perspective. I'll open it if others are still using it. We...