[08:01:34] 10Traffic, 10Operations: Improve ATS prometheus metrics - https://phabricator.wikimedia.org/T231533 (10ema) 05Open→03Resolved a:03ema The version of trafficserver-prometheus-exporter running on the cluster supports custom metrics, and we ship our own metric files. Closing! [08:22:14] 10Traffic, 10Operations: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10ema) 05Open→03Invalid >>! In T161360#5701799, @Aklapper wrote: > @ema: I don't understand how a task about an issue which happened 30 months ago and we're unsure if there is still a problem can have... [08:40:28] 10Traffic, 10MediaWiki-API, 10Operations, 10observability, and 2 others: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854 (10Joe) 05Open→03Resolved a:03Joe This has been resolved for some time: https://grafana.wikimedia.org/d/RIA1lzDZk/application-s... [10:24:28] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) And `[10:23:27] <+icinga-wm> PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%` which already failed: T239041 [10:25:06] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Marostegui) 05Resolved→03Open This host went down again: ` And [10:23:27] <+icinga-wm> PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% ` [10:25:08] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [11:11:42] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Volans) I've updated the mgmt interface's DNS names on Netbox that were still reporting the old names `cloud... [13:55:49] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10MoritzMuehlenhoff) As mentioned in last week's SRE meeting, let's upgrade the firmware to the latest revision cpn cp3053? [14:16:38] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10MoritzMuehlenhoff) @RobH As you offered help in the SRE meeting last Monday, can you upgrade the firmware on cp3053? [14:52:57] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Where we're at now: * There are 13x authdns servers participating in `authdns-update`: ** The 3 traditional ones (`authdns1001`, `authdns1002`, `ganeti3... [14:59:49] ema: o/ [15:00:03] will it be a problem that multiple certs will be valid for stream.wikimedia.org [15:00:05] ? [15:00:21] there's the existing eventstreams endpoint [15:00:33] but also, this new eventgate-logging endpoint, and we will also have an eventgate-analytics endpoint [15:00:40] was planning on using stream.wm.org for all of them [15:00:47] do they all need to use the same cert? [15:32:13] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10RobH) It appears cp3053 is online at this time. Can I issue a depool command and shut it down for the firmware update at any time or is further scheduling needed? The process will take about 5-15 minu... [15:32:34] ^ just checking on cp3053 for firmware i can just issue depool command and power off/ [15:32:34] ? [15:41:19] robh: hi! Yes, just run depool on the host and power off [15:41:28] cool, will do shortly post coffee =] [15:41:33] ty [16:10:36] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10RobH) confirmed with @ema that depool via command line and power off is fine, moving on to flashing firmware. [16:14:10] cool, its now flashing the ilom firmware, once thats done i will flash bios [16:17:13] robh: nice [16:18:29] godog: if you're around, I'm trying to understand how (at the low level, some config file or whatever on prometheus1001) prometheus gets the hostnames of machines it's pulling stats from. [16:18:36] do you have any pointers? [16:19:02] or is that all pushed into a mysql database or something? [16:19:38] bblack: most is from puppetdb, see the various 'class_config' calls here: ./modules/profile/manifests/prometheus/ops.pp [16:20:05] so yeah a database alright but postresql [16:20:31] maybe I should back up and avoid the A/B thing [16:20:49] while digging around in recdns traffic, I noticed we have a surprisingly-high count of internal requests for invalid hostnames resulting in NXDOMAIN [16:21:10] the prime offender seems to be prometheus boxes looking up e.g. mw1229.eqiad.wmnet as mw1229.wikimedia.org first [16:21:33] because resolv.conf search order, but on a deeper level because I'm assuming in some config somewhere, it just has the short name rather than the FQDN [16:22:11] ah, indeed there's hostnames in there [16:22:51] yeah most if not all comes back from puppetdb as hostnames, which also ends up as labels in metrics [16:23:39] puppetdb has FQDNs though [16:25:00] I wonder if it would mess up everything at this point though [16:25:12] e.g. the tags in grafana would all change to fqdns or whatever [16:27:33] cp3053 coming back up post firmware update [16:27:52] ema: I assume you guys wanna check it out before repool? [16:28:13] maybe https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config might be useful bblack [16:28:13] robh: I'll take a look yes [16:28:49] bblack: if we were to use fqdns? yeah [16:31:05] robh: lgtm! [16:31:16] cool all yours ill update task [16:31:29] volans: yeah that sounds promising (set them to fqdns, and have relabel_config strip the domainname from the "instance" label) [16:31:32] robh: repooling then, thank you very much [16:31:46] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10RobH) a:05Vgutierrez→03ema All ilom and bios updated, irc update to @ema and handing this back to #traffic. [16:31:48] welcome =] zero issues in update [16:31:50] went fast [16:32:26] the replace method seems powerful enough and should be easy to regex-split on dots and use only $1 unless we have also sub-domain stuff [16:32:26] but we could also just say that the excess dns lookups don't seem to be causing problems and all it's all premature optimization [16:32:49] but it pains me to watch prometheus spamming 2x the dns requests it has to, pointlessly :) [16:34:00] I bet icinga too is a decent offender [16:34:36] although it has IPs but I recall the resolv.conf search was there too [16:36:39] ottomata: hey! They can all use the same certificate, but also just have stream.wm.org in subjectAltName of different certs I think [16:36:47] we could also fix the search order to better-match the common case [16:36:58] (put eqiad.wmnet first in eqiad, etc) [16:40:26] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10hashar) [16:41:04] !log cp3050: ats-be restart with proxy.config.http.server_session_sharing.pool=global T238494 [16:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:09] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [17:11:14] 10netops, 10Operations, 10Puppet, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10akosiaris) [17:41:05] ok hm ema doing different certs is easier if that is ok [17:41:57] ema: it is a little weird though, no? since stream.wm.org on frontend will have one cert to browsers, but stream.wm.org can be used when requesting the backend app over https too? but then a different cert is used [17:41:57] ? [17:42:34] i guess yeah, only the request from ats/varnish will be using the backend app's cert? [17:43:01] correct, only requests from ATS (varnish does not speak https at all) [18:15:12] oh hm ok [18:35:00] wouldnt ferm rules only allow https connections from $CACHES though? [18:47:32] same cache boxes [19:28:19] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [22:20:45] 10Traffic, 10netops, 10Operations: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10BBlack) [22:23:48] 10Traffic, 10netops, 10Operations: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10BBlack) pdns-rec-exporter fixups in: https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-rec-exporter/+/554155/ + https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-r... [22:45:12] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns4002.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto... [22:57:11] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [22:57:58] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) https://noc.wikimedia.org has been switched to use https://mwmaint.discovery.wmnet (envoy on mwmaint1002). [23:31:26] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns4002.wikimedia.org'] ` and were **ALL** successful.