[00:07:42] 06Traffic, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11142051 (10Krinkle) [00:44:44] 06Traffic, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11142081 (10Krinkle) [01:13:43] FIRING: HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3070&viewPanel=panel-19 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [01:18:43] RESOLVED: HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3070&viewPanel=panel-19 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [03:26:11] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11142215 (10DavidBrooks) Another question about response 429, sorry. Is there a range of values I can expect for Retry-After? AWB already retries 30 second... [08:08:10] 10netops, 10Ganeti, 06Infrastructure-Foundations: magru: move sandbox vlan to routed Ganeti - https://phabricator.wikimedia.org/T402372#11142484 (10ayounsi) 05Open→03Resolved [08:08:21] 10netops, 10Ganeti, 06Infrastructure-Foundations: magru: move sandbox vlan to routed Ganeti - https://phabricator.wikimedia.org/T402372#11142489 (10ayounsi) a:03ayounsi [08:11:01] 10netops, 10Ganeti, 06Infrastructure-Foundations: esams: move sandbox vlan to routed Ganeti - https://phabricator.wikimedia.org/T403580 (10ayounsi) 03NEW p:05Triage→03Low [08:32:41] 10netops, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: esams: move sandbox vlan to routed Ganeti - https://phabricator.wikimedia.org/T403580#11142603 (10ayounsi) [08:57:54] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11142724 (10elukey) I am testing the new provision script on other cp2xxx hosts, and I always end up with the following diff when checking if the settings have... [08:59:57] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11142742 (10elukey) Summary of the status: * Test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1173335 to have a final version of the cookbook that... [09:41:41] 10netops, 10Ganeti, 06Infrastructure-Foundations: esams: move sandbox vlan to routed Ganeti - https://phabricator.wikimedia.org/T403580#11142999 (10ayounsi) @MoritzMuehlenhoff I tried to create the VM using `sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 2 --disk 50 --network sandbox --os none --cluster... [10:25:22] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11143218 (10Vgutierrez) I think we have 3 upcoming DAGs: * the one covered by this task * probenet data for the GeoDNS pipeline (... [11:00:05] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11143345 (10brouberol) IMHO piggybacking on `platform_eng` might allow you to get a faster head start. If you find yourself in a... [11:01:10] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11143351 (10Vgutierrez) That's great, thanks @brouberol [13:15:20] vgutierrez: hello! is it expected to push that much traffic between eqiad and esams ? https://librenms.wikimedia.org/device/device=176/tab=port/port=19111/ we're getting close to saturating our 10G link [13:29:52] * vgutierrez looking [13:33:30] XioNoX: it looks like we have some increased usage in the last two days of August: https://grafana.wikimedia.org/goto/RXtUBdrNR?orgId=1 [13:35:47] thx, looks like it stopped since, your graph is more clear [13:36:30] and correlates with https://grafana.wikimedia.org/goto/lSH3fd9Ng?orgId=1 [14:04:02] Hi, does the edge (haproxy, varnish?) generate and set X-Request-Id? [14:04:38] and/or can it for all requests? And/or can it for pass-thru requests? [14:05:13] ottomata: at present it does not, right now the earliest that gets set is in Envoy [14:05:44] can it? context: we'd like to correlate webrequest logs with events [14:11:47] cdanis: (sorry for ping, am in meeting right now and it would be helpful to know :D ) [14:12:25] maybe [14:12:26] :) [14:16:53] hah perfect :D [14:22:59] ha dan linked me: T244545 [14:22:59] T244545: Add x-request-id to httpd (apache) logs - https://phabricator.wikimedia.org/T244545 [14:23:05] "Please note this will only be useful if we start generating a request id at the caching layer too" [14:32:38] ah ha [14:32:39] https://phabricator.wikimedia.org/T221976 [14:34:39] nowadays it should be handled by HAProxy IMHO [14:37:11] sukhe: can you take care of deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180577/20 and previous CRs in that branch? [14:37:25] /cc Krinkle [14:37:52] yes, happy to. I am in a meeting for the next 1.5 hour but after that yes. [14:38:04] Krinkle: let me know if that works (or not) [14:38:04] ack [14:38:14] ack [14:38:15] basically the same problem here.. starting meetings soon :) [14:38:42] The goldilock zone hour of cross-continental meetings. [14:44:35] vgutierrez: i'll edit task descrition [15:08:33] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144263 (10Papaul) For reference please see case number below ` Dear Papaul , Thank you for contacting Dell Technologies technical support, from this moment... [15:50:31] 10netops, 06Infrastructure-Foundations, 06SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11144499 (10Papaul) manually disable OSPF (using commit confirmed) make the mgmt goes down when done on mr1 or cr3/cr4 . But mr1-ulsfo.oob.wikimedia.org and mr1 loopback stil... [15:52:22] Krinkle: does 16:30 UTC work for the deploy? [15:52:44] that's ~38 mins for now [15:53:45] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144512 (10Papaul) Here is the email from Dell ` Dear Papaul, Thank you for your patience while we investigated the issue regarding the Virtualization Techno... [15:54:36] sukhe: yes [15:58:48] thanks, will ping you then [16:23:47] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144628 (10RobH) [16:27:35] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144668 (10RobH) [16:30:42] Krinkle: ready to start. [16:30:46] which one we are doing first? [16:32:36] The test patch I suppose, bottom up, 3 total. [16:32:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180969 [16:33:10] cool. I guess I will do one by one [16:33:19] that would be my preference unless you tell me otherwise [16:35:18] starting with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180969 [16:40:04] Krinkle: I am guessing there isn't anything really to test in this one? [16:40:08] nope [16:40:25] merging on one and proceeding [16:48:07] merged 60s cap [16:48:22] now merging the last one https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180577 [16:49:01] rebasing [16:54:09] Krinkle: looks good. which cp host you are hitting? in case you want to test the direct routing for mobile views change [16:55:02] cp3069 it looks like [16:57:54] applied on cp3069 [16:58:01] I will wait for you to test before rolling it out CDN-wide [16:58:25] ok, checking [16:59:51] sukhe: LGTM [17:00:13] cool, thanks. rolling it out everywhere else, -b31 [17:06:26] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144864 (10RobH) [17:10:15] Krinkle: all done [17:10:21] that's all for today I think? [17:11:17] Yeah, I'll refactor to use puppet variable and then continue tomorrow with the other phase 1 wikis. [17:11:41] We can do another deploy tomorrow? [17:12:03] yes Thursday should be fine [17:12:34] feel free to ping us here for reviews/deploys when you want tomorrow [17:35:45] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144994 (10RobH) cp2045 has had the idrac, bios, and SSD firmware updated to latest revisions to match cp2043. Please note that these have been brought online... [17:37:05] 06Traffic, 10DNS, 06SRE: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11144999 (10ssingh) a:03CDobbins [17:48:38] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11145066 (10ssingh) >>! In T392851#11144512, @Papaul wrote: > Here is the email from Dell > ` > Dear Papaul, > > Thank you for your patience while we investig... [17:54:18] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep - https://phabricator.wikimedia.org/T403616#11145087 (10ssingh) [18:00:28] 14Varnish, 06Data-Engineering, 06Data-Engineering-Icebox, 06SRE, and 2 others: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817#11145138 (10Ottomata) [18:08:48] 06Traffic, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11145174 (10nshahquinn-wmf) [18:38:47] 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06SRE: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11145300 (10Ejegg) Hi @ssingh , I'd be hesitant to start using a different cookie for CentralNotice - we do sometimes ne... [19:00:21] 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06SRE: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11145362 (10ssingh) >>! In T122097#11145300, @Ejegg wrote: > Hi @ssingh , I'd be hesitant to start using a different coo... [19:48:53] 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06SRE: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11145538 (10Ejegg) So as long as we switch CentralNotice to use the new cookie at the same time analytics switches the W... [20:17:42] @vgutierrez [20:18:16] I'm wondering if I can move https://phabricator.wikimedia.org/T221976 out of the icebox? [20:18:58] I'm going to be bold and maybe we can talk on there about it. It's not something we need urgently but it's come up in a couple of different places that it would be useful to correlate some ID that's unique to a webrequest with some instrumentation data [20:19:45] milimetric: I'm interested in that from the distributed tracing side as well :) [20:27:30] 06Traffic, 06SRE: Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11145677 (10Milimetric) I'm being bold and removing this from the Icebox so we can talk about it. I've heard a few different folks talk about related needs, but I'll just de... [20:34:20] thanks for the comments and the links to relevant tasks, @cdanis, happy to set up a meeting if that makes it easier but I'll wait a bit and see what gets discussed on-task [20:34:45] yeah np, we already have one set up next week anyway re: SDS1.3.1