[02:42:00] 10netops, 10Operations, 10ops-codfw: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces] - interface-range vlan-cloud-support1-b-codfw { - member ge-8/0/9; - mtu 9192; - unit 0 { - fami... [02:45:43] 10netops, 10Operations, 10ops-codfw: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) @ayounsi since i deleted the interface range do you want me to delete also the VLAN cloud-support1-b-codfw [11:58:41] 10netops, 10Operations, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) [12:31:18] 10netops, 10Operations, 10ops-codfw: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10ayounsi) Nop, please keep the vlan. [13:22:10] 10Traffic, 10Operations: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 (10Vgutierrez) [13:22:29] 10Traffic, 10Operations: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 (10Vgutierrez) p:05Triage→03Medium [13:44:11] 10netops, 10Operations, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10jbond) p:05Triage→03Medium [13:44:53] 10Traffic, 10MediaWiki-Debug-Logger, 10Operations: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10jbond) p:05Triage→03Low [13:51:10] 10netops, 10Operations: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10ayounsi) Down BGP sessions disabled on our side until the remote side is fixed. [14:00:29] 10netops, 10Operations: Librenms sessions are stored inside the deployment directory - https://phabricator.wikimedia.org/T239412 (10jbond) [14:06:29] 10netops, 10Operations, 10Patch-For-Review: Librenms sessions are stored inside the deployment directory - https://phabricator.wikimedia.org/T239412 (10jbond) 05Open→03Resolved a:03jbond i have excluded the files in this directory from puppet managment [14:24:34] 10netops, 10Operations, 10decommission, 10ops-eqiad: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) Ping? [14:41:00] 10Traffic, 10netops, 10Operations: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:41:05] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Vgutierrez) [14:56:03] 10netops, 10Operations: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) a:03faidon In https://github.com/wikimedia/puppet/blob/59ae7b6aa0f8413b4a9a0479089b69b823aca532/modules/nagios_common/files/check_commands/check_bgp#L299 the script ignores the... [15:08:28] 10netops, 10Operations, 10Patch-For-Review: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) a:05faidon→03ayounsi From Faidon: > idle IIRC was when the other side shut down their sessions > but feel free to remove that elsif and see what happens [15:08:35] go bd808 [16:06:35] 10netops, 10Operations, 10Patch-For-Review: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) 05Open→03Resolved Tested and works as expected, will re-open if any false positive. [16:35:27] Ok traffic folks [16:35:39] I'd like to resume the bios updates via https://phabricator.wikimedia.org/T243167 [16:35:57] i've not since the outage, so i have cp1088-cp1090 in eqiad to finish. [16:36:03] ack [16:36:11] we've finished the reimage to buster [16:36:20] so there is no scheduled reboots/depools in our side [16:36:26] awesome [16:43:47] Ok, so I know that I asked this before [16:43:52] but I neglected to copy it down to the task [16:44:06] but is there something i can run before taking a host offline to check the overall health and how many servers are in a service pool? [16:44:37] vgutierrez: ^ [16:47:13] bblack: ^ [16:48:39] robh: https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-layer=tls&var-cluster=upload&from=now-6h&to=now [16:49:26] Oh, that is helpful, thank you =] I was thinking more a command that tells me X of Y servers are pooled [16:49:39] but if while im working i see this start to spike, i know i messed something up ;D [16:49:43] other than that [16:49:48] confctl CLI [16:49:57] there's https://config-master.wikimedia.org/pybal/eqiad/text-https too :) [16:50:08] bblack@cumin1001:~$ confctl select 'name=cp1.*' get|sort [16:50:14] was what I pasted before [16:50:31] and https://config-master.wikimedia.org/pybal/eqiad/upload-https for upload [16:51:03] vgutierrez@puppetmaster1001:~$ sudo -i confctl --quiet select 'dc=eqiad,cluster=cache_text,service=ats-tls' get |grep yes |wc -l [16:51:03] 8 [16:51:03] vgutierrez@puppetmaster1001:~$ sudo -i confctl --quiet select 'dc=eqiad,cluster=cache_text,service=ats-tls' get |grep no |wc -l [16:51:03] 0 [16:51:05] even that :) [16:51:13] 8 pooled, 0 depooled [16:51:24] bblack: thank you i knew you had told me and i forgot to copy ot notepad [16:51:29] also thanks to everyone else too [16:51:40] i just know bblack had answered it last week so i felt bad reasking ;D [17:08:39] 10Traffic, 10Operations, 10ops-eqiad: cp1088 - https://phabricator.wikimedia.org/T245645 (10RobH) [17:10:28] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqiad, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [17:16:24] so i dunno how to read https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-layer=tls&var-cluster=upload&from=now-6h&to=now [17:16:35] becasue before i took 1090 offline, it shows a plummet in graphs [17:16:37] which seems odd to me. [17:17:05] because it's depooled [17:17:08] so it's not getting traffic [17:17:15] but that was before i shut it down [17:17:31] it plummets like 15 min ago, i shouldnt have powered down if i noticed it [17:17:38] or i would have left for you folks to look at [17:17:46] the graph hover data is confusing [17:18:03] yeah, i didnt get it so i echoed here [17:18:13] so, if you zoom into this view: [17:18:15] https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-layer=tls&var-cluster=upload&from=now-1h&to=now&fullscreen&panelId=4 [17:18:32] when you hover on the top line, it highlights as cp1090 [17:18:46] but really that's not what it seems to mean, visually [17:19:13] if you click on cp1088 at the bottom, you'll see it's the one that actually dropped out there, ~16:53 [17:19:26] ahhh, which is expected [17:19:35] yeah, the visualization is misleading when they are all aggregate [17:19:36] and if you click on cp1090 to show just it, you see the more-recent drop [17:19:41] the way they stack is odd, yeah [17:19:47] ok, i feel better about it now [17:20:00] things make sense to the cadence i was working [17:20:08] it's because they're in a certain order and stacked, so the last two in the order are 1088 and 1090 [17:20:18] so if 1088 drops off, the 1090 part drops off too, because stacking [17:20:30] thx for explanation =] [17:20:34] its appreciated [17:20:38] np! [17:20:55] finishing the last 2 in eqiad now, then i'll work on a few procurement things [17:20:58] and then work on esams [17:21:06] i figure the later in day i push esams, the better [17:21:16] yup! [17:21:33] so ill wait for post lunch at minimum [17:39:19] hrmm, icinga checks purple longer for these last two cp hosts 1089, 1090 [17:42:58] you can checkbox the purple ones and tell icinga to "Re-schedule next service check" if you want to speed them up [17:43:18] some of the checks have long intervals naturally, so they stay borked a little longer by default after a reimage, depending on random timing factors [17:50:47] yeah, i forced them right after i said that [17:50:53] all good [17:51:01] all eqiad cp hosts have been updated to the newest bios release [17:51:11] so if there are any crashes there, its important to note it [17:51:26] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqiad, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [17:53:26] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqiad, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Please note as of now, all eqiad cp sysetms have been updated to the latest bios revision. If these hosts experience any further crashes, i... [17:54:04] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqiad, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [17:54:28] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [18:50:58] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) In this topic branch i am also switching monitoring of these services from HTTP to HTTPS: https://gerrit.wikimedia.org/r/q/topic:%22icinga-http-https%22+(status:op... [21:57:03] 10netops, 10Operations, 10ops-codfw: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) 05Open→03Resolved a:03Papaul Complete [22:36:54] Ok, going to work on some esams cp firmware upgrades [22:37:28] i noticed the requests start to go down on the graphs [22:37:34] so figured ok to start=] [22:55:59] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [23:09:39] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 10 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10Krinkle) [23:09:51] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 10 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10Krinkle)