[06:50:25] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs2008.codfw.wmnet ` The log can be found in... [07:19:09] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2008.codfw.wmnet'] ` and were **ALL** successful. [07:39:07] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul I had to upgrade the NIC FW on lvs2008 `name=before vgutierrez@lvs2008:~$ sudo -i ethtool -i ens2f0np0 driver: bnxt_en version: 1.9.2 firmware-version: 20.6.... [10:36:47] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs5002.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [11:17:43] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs5002.eqsin.wmnet'] ` and were **ALL** successful. [11:54:58] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [13:48:34] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs5001.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [14:04:29] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10Vgutierrez) ` $ openssl s_client -connect upload-lb.ulsfo.wikimedia.org:443 2>&1 < /dev/null |openssl x509... [14:26:26] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Vgutierrez) [14:30:08] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs5001.eqsin.wmnet'] ` and were **ALL** successful. [14:37:22] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2005.codfw.wmnet` - lvs2005.codfw.wmnet (**PASS**) - Downt... [14:44:20] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [14:53:03] hey, I have an patch to add an lvs service that I'd like to promote to "lvs_setup" today https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/575631/ [14:53:38] the docs at https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers say in 3 ways to check in here first, so here I am :) [14:54:02] does that patch look ok? and in terms of timing will today work? [14:58:05] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Vgutierrez) [14:59:46] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:03:52] herron: looks good IMHO [15:04:22] servers are pooled in both DCs.. IPs seem to be ok... [15:05:06] vgutierrez: awesome thx for looking so quick [15:05:30] so I have an hour now before meetings I could deploy this, if that works? [15:05:57] same here [15:06:04] +1 [15:06:34] excellent, ok could I trouble you for a +1 on the patch? "for the books" [15:07:40] hmmm [15:07:49] I didn't check that the lvs is able to reach port 443 on those logstash instances [15:08:09] ah ok, holding off for that [15:08:53] codfw looks ok [15:09:20] eqiad too :) [15:11:10] beautiful [15:11:15] ok, proceeding! [15:16:13] merged and puppet-merged, running puppet on O:lvs::balancer now [15:19:52] done, next step is pybal restart on low-traffic primary, yes? [15:20:00] vgutierrez: that is lvs1015 and lvs2009? [15:20:14] I'd restart the secondaries first :) [15:20:21] so lvs1016 and lvs2010 [15:20:33] you can break the secondaries without messing with production traffic :) [15:20:36] ah thanks, the doc even says that [15:20:44] that makes way more sense! [15:20:59] usually I stop pybal, sleep 5 secs and start pybal again [15:21:08] to avoid BGP flapping too fast [15:21:18] ok, good to know, will do that [15:21:43] i don't think we have bgp dampening enabled, do we? [15:21:48] maybe we do :) [15:21:50] a quick check afterwards for BGP would be sudo -i journalctl -u pybal --since="5min ago" |fgrep -i bgp [15:22:51] ok, doing [15:25:37] done and log looks good to me [15:27:18] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2004.codfw.wmnet` - lvs2004.codfw.wmnet (**PASS**) - Downt... [15:27:21] herron: yup [15:27:32] alright, moving on to primaries [15:27:49] BTW [15:27:52] Mar 02 15:24:14 lvs2010 pybal[16788]: [kibana-ssl_443] INFO: New enabled server logstash2004.codfw.wmnet, weight 0 [15:27:52] Mar 02 15:24:14 lvs2010 pybal[16788]: [kibana-ssl_443] INFO: New enabled server logstash2005.codfw.wmnet, weight 0 [15:27:52] Mar 02 15:24:14 lvs2010 pybal[16788]: [kibana-ssl_443] INFO: New enabled server logstash2006.codfw.wmnet, weight 0 [15:27:58] those weights need to be fixed :) [15:28:17] ah yes [15:29:47] oh I see, that's the prod cluster, ok [15:30:57] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3007.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:31:57] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [15:36:29] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:37:06] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [15:47:38] thank you vgutierrez [15:54:14] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3007.esams.wmnet'] ` Of which those **FAILED**: ` ['lvs3007.esams.wmnet'] ` [15:57:58] bblack: when you'll be around I've a question for you regarding gdnsd and encoding, currently deploy-check.py works only on py3.7 with prod data ;) [15:58:39] ok [15:58:42] what's the question? [15:59:08] volans: ^ [15:59:09] we add a comment to the generated files with the serial of the zone, the commit sha1 and the first line of the commit [15:59:25] in 2 files the commit line has non-ascii chars [15:59:30] ok [15:59:48] and when we do Path().write_text() it fails on 3.5/3.6 because of defaulting to locale or ascii [16:00:00] while works on 3.7 because of different defaults [16:00:17] first I wanted to know if this is a prob on gdnsd side [16:00:41] so, it's basically an rfc-defined spec for the basics of the file, and it doesn't account for anything non-ascii. In theory unprintable (from the ascii pov) bytes are just binary data and "work" in some sense. [16:00:44] and based on that decide waht's the best fix [16:01:14] (I think tests even exercise unprintable bytes in data, not just comments) [16:01:20] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [16:01:44] but it isn't a utf-8 file, and it would probably be better to strip/convert/underscore/whatever utf-8 bytes coming through from commits msgs [16:02:36] ack [16:05:03] I'll send a patch later in this sense then [16:09:38] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10Papaul) 05Open→03Resolved Complete [16:09:41] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:10:15] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:11:04] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:11:54] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:22:07] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2007.codfw.wmnet'] ` [16:36:18] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [16:36:22] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2007.codfw.wmnet'] ` [17:03:39] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [17:03:55] 10Traffic, 10netops, 10Operations, 10ops-codfw: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) [17:05:58] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Milimetric) p:05Triage→03Medium We should have a meeting about this towards the end of this quarter / beginning of next.... [17:42:46] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [17:48:01] 10Traffic, 10netops, 10Operations, 10ops-codfw: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) @Vgutierrez looks like re-seating the transceiver fix the problem. you can get the traffic back if we see the error again we will replace the... [18:04:57] 10Traffic, 10netops, 10Operations, 10ops-codfw: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) @Vgutierrez after you got the traffic back the error came back again so we will have to replace the transceiver tomorrow. [18:06:29] 10Traffic, 10netops, 10Operations, 10ops-codfw: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) p:05Triage→03Medium [18:07:44] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` and were **ALL** successful. [18:08:20] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [18:08:44] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) 05Open→03Resolved @Vgutierrez lvs2007 is ready for service. [18:48:35] 10netops, 10Operations: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) a:03faidon Seems like they changed things. From https://www.gtt.net/us-en/support/ "Finding your NOC contacts on the dashboard in EtherVision" @faidon do you have portal access? If so... [18:51:28] 10netops, 10Operations: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) Until then I updated the Netbox page to match their PeeringDB NOC contact: https://www.peeringdb.com/net/14 [19:48:53] vgutierrez, nice. maybe we should do some announcement about using LE certs somewhere? [21:12:49] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10LilyOfTheWest) @Milimetric that is a good point. @Miriam I suggest replacing "highly anonymized" in the task description w... [21:16:16] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Milimetric) The internal use cases would be nice to support, and I think we can discuss that separately from how much we tru... [21:17:07] 10netops, 10Operations: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10ayounsi) Note that enabling `graceful-restart` will cause all BGP sessions to flap. [21:19:46] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) @Milimetric that is a good point. @Miriam I suggest replacing "highly anonymized" in the task description with "suf... [21:34:51] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) What is the number of users this potential system would serve? 10/100? [21:41:10] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) @Nuria can you help me understand in what sense the answer to this question is important? Is it about RAM and Storage... [21:46:54] 10Traffic, 10netops, 10Operations: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) [21:49:54] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) This ask, in terms of infrastructure is a significant one and we would like to e how many users are benefiting from i... [22:19:53] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) >>! In T245833#5934960, @Nuria wrote: > This ask, in terms of infrastructure is a significant one and we would like t... [22:47:58] bblack: if you have a minute chaomodus and I would like to get to a conclusion for https://gerrit.wikimedia.org/r/c/operations/dns/+/569340/5/utils/deploy-check.py#152 and seems it will be quicker to do it here all 3 of us [23:00:31] volans: sure [23:01:34] I'll start with a quick one, any reason to discard the {% include %} option in jinja instead of $INCLUDe? [23:01:56] less tools dependency, simplicity? [23:02:30] do you see us not using jinja at all? [23:02:43] it'd be nice if we could get there someday, not sure [23:03:01] but with the netbox-generated stuff out of the way, the rest may be manageable with simple includes and softlinks [23:03:09] either way, why do something more complex than needs be? [23:03:12] k8s has loops [23:03:20] but yeah out of scope here [23:03:47] if we go $include we are kinda forced to adapt zone_validator too, because otherwise we kinda fly blind [23:03:57] did you read my response alreadY? [23:03:59] the main upside of jinja include is we can have validate-dns in the loop with the new snippets [23:04:00] yes [23:04:34] you can ignore $include pretty trivially (it might already do so, I don't know) [23:05:10] right now zone-validator isn't doing any kind of "complete" validation of a zone anyways, it's just checking what it chooses to check. choosing not to check something that was autogenerated and validated by another tool, doesn't seem awful. [23:05:19] yes, but i guess the issue is that it doesn't check the overlap between manual and automatic records then [23:05:29] yes, that's a risk [23:05:43] yes $include is already ignored [23:06:13] the problem is that let's say, we change mgmt.esams [23:06:13] we'll delete al lthe old ones that netbox is replacing, I guess the issue that remains is any accidental overlap in the future [23:06:17] the test branch of dns which uses an $include validates more or less ok from gdns and sort of from the validator (modulo a bunch of missing records from its perspective) [23:06:26] zone validator will complain that records in esams.wmnet don't have a mgmt record [23:07:25] sorry, I don't follow what you're saying there with the example case [23:07:38] https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/1892/console [23:07:43] W001|MISSING_IP_FOR_NAME_AND_PTR: 1909 [23:07:53] currently in master is at 381 [23:08:18] because zone validator there is loosing one side of the records, just the A mgmt ones [23:08:18] because of? [23:08:21] right b ut of course that might go away if we remove the ptr records for an $INCLUDE also [23:08:24] but has still all the PTRs [23:08:30] why isn't the generated data doing both sides? [23:08:38] we can do both at the same time [23:08:50] and then it will complain that hosts in eqiad don't have mgmt records [23:09:07] I didn't test it but I'm pretty confindent it will complain :) [23:09:14] well you can approach that one of two ways: [23:09:36] 10Traffic, 10Operations, 10ops-esams: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10RobH) [23:09:38] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [23:10:00] 10Traffic, 10Operations, 10ops-esams: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10RobH) Please note this may be fixed by T243167, which I'm doing (as time and esams condition permits.) [23:10:00] 1) You can continue trying to make zone-validator a more-perfect validator of zones (there's tons of missing cases anyways, and data that it ignores), and have it parse $INCLUDE on its own to grab the data in the process [23:10:20] 2) You can stop checking for mgmt records in zone-validator because they're generated now and don't need double-checking what a human messed up. [23:10:38] well or I'm sure there's like 5 other paths [23:10:44] lol [23:11:11] one of which is to transclude the data via jijna so that zone-validator can see it without having to parse $INCLUDE [23:11:14] ok, then the follow up question is, how long will be the transition period in which esams will be autogenerated and all the other DCs not? [23:11:28] because in that transition period we want to still be protected from human errors [23:11:40] why do we have a transition period of partial coverage at all? [23:12:10] to trust the tool? [23:13:38] I'll have to pick up this debate later, I really do have to run and am getting angry looks [23:13:59] sure, no prob, thanks for the chat [23:14:08] thanks a lot as always for feedback :) [23:14:28] but in general stepping out a layer, as a parting thought [23:14:45] I just tend to see this all as a fairly simple affair. it's easy to overcomplicated it. [23:14:48] *overcomplicate [23:15:38] if it generates all the data, and it all matches present-day... [23:15:49] then it worked, and it's trustworthy, as much as anything can be [23:16:17] we can debug that on all the data before we deploy any of it into live use [23:16:28] the day we migrate, what for follow up changes? [23:16:35] *sure, [23:16:41] either way you had the same problem for followup changes later [23:17:36] I guess we'll move some part of the validation in netbox reports, to ensure that the data is valid there in the first place [23:18:26] bblack: not really, the validator today could catch that two hosts have the same dns name, I don't think netbox enforces it in any way [23:18:57] so still a human error on netbox could be caught, but sure, we'll go the $INCLUDE way and adapt the validations steps as we need going forwards