[07:14:36] not sure if it's a known issue already, but cp3033 (currently depooled) lost it's network connectivity about 19hrs ago [09:46:53] 10Traffic, 10Operations, 10Patch-For-Review: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) After merging change 445611, we have the following offenders at the top 10: ```$ tshark -r dns.pcap -Y "dns.flags == 0x8005" -Tfields -e dns.qry.na... [10:15:52] 10Traffic, 10Operations, 10Patch-For-Review: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) refused queries dropped significantly after merging change 445611 as well. {F23795364} I guess that we should keep and eye on this recurrently [10:17:46] 10Traffic, 10Operations: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [10:46:06] ugh, cp5001 is also down since yesterday [10:46:15] vgutierrez: known ^ ? [10:46:47] * vgutierrez checking [10:47:17] the servers miss ema [10:50:10] cp5001 is reachable via mgmt console [10:50:20] only thing that the console shows is "Startin" [10:54:26] no getty no anything? [10:54:55] nope [11:02:40] !log Power cycling cp5001 to attempt recovering it [11:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:26] is cp5001 depooled? [11:03:49] just to be sure that if it comes up complaining not traffic will be sent to it :) [11:06:17] elukey: done, thx :) [11:06:26] <3 [11:06:32] so.. it booted as expected [11:06:35] vgutierrez: not sure if you my pointer in backscroll earlier the morning; cp3033 lost network connectivity [11:06:58] I'll check it as well :) [11:07:06] ack :-) [11:10:37] it looks like cp5001 has some memory issues [11:14:19] [OT: check ocsp stapling before repooling ;) ] [11:15:00] volans: I'm not going to repool it.. [11:15:17] at least not yet with the kernel complaining about one memory DIMM [11:15:30] ack [11:17:11] vgutierrez: System event log also reports B4 as broken, this will need to be replaced [11:26:48] on cp3033 something looks funny about the NIC [11:30:19] 10Traffic, 10Operations, 10ops-esams: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Vgutierrez) ```root@cp3033:/var/log# ethtool -i eth0 driver: bnx2x version: 1.712.30-0 firmware-version: FFV7.10.17 bc 7.10.11 bus-info: 0000:01:00.0 supports-statistics: yes sup... [11:33:42] we lost network link @ cp3033 after [10415964.660793] NETDEV WATCHDOG: eth0 (bnx2x): transmit queue 6 timed out [11:49:53] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10Vgutierrez) both kernel and server event log shows issues on DIMM B4: ``` 3 | 07/14/2018 | 17:49:17 | Memory ECC Uncorr Err | Uncorrectable ECC (UnCorrectable ECC | DIMMB4) | A... [11:52:29] 10Traffic, 10Operations, 10ops-esams: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Vgutierrez) After a power cycle the server it's behaving properly. Since it was already depooled I'm not repooling it [13:56:19] 10Traffic, 10Operations, 10ops-esams: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Vgutierrez) p:05Triage>03Normal [14:39:39] yeah so with 5001 offline we're now down to 4 servers for upload@eqsin, since 5006 is in the same pool and hasn't yet been fixed from initial problems when installed. [14:40:33] which is basically our design limit: having 2x offline longer-term waiting on repairs is the edge of acceptability. if another goes, we'll just have to depool the cluster for that site. [14:40:44] ack [14:42:01] with only 4 available, the routine backend restarts that pull 1/N from the pool temporarily (the cron ones as well as any manual maintenance for code deploys, etc) make 25% of the content rehash to a new node temporarily, which is a ton of churn. [14:42:13] we'll see how it goes, since the design limit is a guestimate based on history :) [15:30:40] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10ayounsi) [15:30:42] 10Traffic, 10Operations, 10ops-eqiad: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ayounsi) [15:33:42] 10Traffic, 10Operations, 10Goal: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) [15:33:58] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10BBlack) p:05Normal>03High Turning priority to "high" for this and the 5006 ticket, as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes. [15:34:24] 10Traffic, 10Operations, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Vgutierrez) [15:34:29] 10Traffic, 10Operations, 10Goal: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) [15:35:29] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) p:05Normal>03High Turning priority to "high" for this and the 5001 ticket ( T199675 ), as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes. [15:36:09] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) I put in the self dispatch last week, but have not gotten a reply on it. I'll fall back to simply calling into technical support daily until this gets a resolution. [15:44:10] 10Traffic, 10Operations: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Vgutierrez) p:05Triage>03Normal [15:54:55] 10Traffic, 10Operations: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10BBlack) [15:55:29] 10Traffic, 10Operations: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10BBlack) [15:55:32] 10Traffic, 10Operations: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853 (10BBlack) [16:05:32] 10Traffic, 10Operations: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Vgutierrez) From https://letsencrypt.org/docs/client-options/, another interesting option could be free_tls_certificates library. It's a high-level library based on python3-acme, on an initia... [16:49:39] 10Traffic, 10Operations: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Krenair) Need to ensure that whatever we pick has the ability to be extended in terms of how challenges are done. I.e. we'll want to be able to have http-01 write to files, and dns-01 either... [17:00:53] 10Traffic, 10Operations: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Krenair) >>! In T199717#4428126, @Vgutierrez wrote: > From https://letsencrypt.org/docs/client-options/, another interesting option could be free_tls_certificates library. It's a high-level l... [17:14:31] bblack: Question for you regarding domain validation and our dns system. Globalsign needs me to verify we own wikimedia.org for me to issue a renewal of managed ssl for *.corp.wikimedia.org [17:14:37] and they give me a dns txt record to do so [17:14:52] just ensuring its as simple as putting it into our zone file and pushing update [17:15:08] since ive not done this since back when we used a different dns system here ;D [17:16:20] (I assumed you were the person best to ask regarding our dns ;) [17:16:22] yeah you should be able to merge it with authdns-update as normal [17:16:49] but in general, I know GlobalSign sent out some notice a while back about re-authing domains in general [17:16:54] there was some policy change or something [17:17:16] so, I guess we'll probably have to go through this for all the others well ahead of the next unified cert renewal too [17:17:24] just something we should keep in mind [17:17:59] yeah, ill flag you to review my change [17:18:29] also, I believe GlobalSign supports ACME as well (like LE), so at some point we might be able to semi-automate our GS renewals a bit better too. [17:19:28] https://gerrit.wikimedia.org/r/#/c/operations/dns/+/446066/ [17:19:55] 1. Create a DNS TXT record for the domain with the validation code CZZ+oNsznq8UgvW4DTAQM7zFm4+USRZs+F68lvn4FfE=. [17:20:02] the . is not part of code [17:20:11] (as another line has the code in quotes) [17:20:33] I assume it's temporary and we can remove it after the validation is done [17:20:37] correct [17:20:44] its what we've done in the past at least [17:20:46] ok [17:20:57] ive watched someone (i thought it was you) do this for past year validations, heh [17:21:03] but its only a vague recollection [17:21:14] (it could have been anyone else just i know it wasnt me i was just lurking the changes) [17:21:26] thanks!! [17:22:04] and now ive done it recently so when renewal itme rolls around in a few months, its familiar [18:10:51] 10netops, 10Operations, 10fundraising-tech-ops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10Jgreen) DNS is done! ;; ANSWER SECTION: frmon.wikimedia.org. 3600 IN CNAME frmon-eqiad.wikimedia.org. frmon-eqiad.wikimedia.org. 3600 IN A 208.80.155.9 [19:32:36] 10netops, 10Operations, 10fundraising-tech-ops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10ayounsi) NAT created: ```lang=diff [edit security nat static rule-set static-nat] rule frbast1001 { ... } + rule frmon1001 { + match { + de... [19:34:53] 10Traffic, 10Operations: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10Varnent) @BBlack - excellent - thank you!! [20:29:45] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10ayounsi) >>! In T184293#4415745, @mark wrote: > # On asw2-d-eqiad, xe-2/0/4 is part of the "access-ports" group which sets a high MTU, whereas it doesn't seem to be on t... [20:30:23] also, I believe GlobalSign supports ACME as well (like LE), so at some point we might be able to semi-automate our GS renewals a bit better too. [20:30:25] interesting [20:30:44] I guess that explains why we made the central cert service generic in naming