[00:08:07] https://blog.cloudflare.com/how-we-scaled-nginx-and-saved-the-world-54-years-every-day/?ref [07:47:29] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [08:24:44] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` ['cp5008.eqsin.wmnet', 'cp5002.eqsin.wmnet'] ``` The log can be found in `/var/log/w... [08:41:25] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) Thank you all for the investigation. The amount of indexed URLs seems w... [09:02:38] robh: no problem, cp5001 is currently depooled so it won't be an issue [09:12:45] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5002.eqsin.wmnet', 'cp5008.eqsin.wmnet'] ``` and were **ALL** successful. [11:39:35] bblack: Hey, I'm picking up https://phabricator.wikimedia.org/T160692 which means ORES uses pool counter to limit number of connections coming to ORES from certain IPs (basically to improve fighting DoS) but I have two questions: Is it fine to add such load to pool counter, you probably a good idea how big is external requests that coming to ORES. Also, I want to have a set of whitelisted internal IPs [11:39:56] basically any internal IP should bypass the check, do you know how is the best way to do it? [12:32:45] there is definitely something weird going on with our TLS stats... they're still reporting AES128-SHA usage but I'm unable to catch a single request on the cp nodes using that ciphersuite (as expected) [12:51:40] yeah at some layer things are being over-smoothed and then stretched [12:51:47] I bet it's a week before we see the effect [12:51:55] (in stats) [12:53:14] Amir1: if your whitelist is configurable, we could template in the set of our networks from puppet, we use that same list for a lot of similar things [12:53:49] Amir1: the other part, analytics tools would know best, we can take a peek at them [12:59:47] hmm the only way on a vtc to ensure that a response comes from a synth call is using logexpect, right? [13:01:08] bblack: That would be great, thank you! [13:04:50] Amir1: in rough terms over the past week, if I'm interepreting Turnilo correctly, the averages for ores are ~128 reqs/sec, with roughly 23% of those coming from internal IPs [13:05:00] does that sound sane? [13:05:14] yup [13:05:18] Thank you! [13:05:52] vgutierrez: logexpect seems like the cleanest way, yes [13:06:08] bblack: puppet is still disabled on cp-misc, does it need to be? [13:06:49] no, leftover from me fixing up filesystems on the stretches I think, but let me double-check I actually finished what I started [13:07:41] yes it appears I finished up everything except for puppet re-enable [13:08:00] (I pushed the mke2fs changes back to stretch as the decider, and then fixed up all the existing stretch nodes) [13:08:06] nice [13:08:13] looks like 500[28] used it fine [13:12:24] ema: yey, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450020/ seems to work as expected :D [13:12:57] see you guys at the meeting.. it's time to get my back fixed a little bit /o\ [13:19:31] interesting to see the temperature changes on LVSs after switching traffic [13:19:35] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=lvs3002&var-datasource=esams%20prometheus%2Fops&from=now-1h&to=now-1m [13:19:39] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?var-server=lvs3004&var-datasource=esams%20prometheus%2Fops&orgId=1&from=now-1h&to=now-1m [13:22:18] not that much in CPU.. but the temp. difference is huge [13:22:38] yeah [13:23:04] we don't track frequency scaling averages though [13:23:20] when a server has little to do, cpu speed drops near minimum mhz and heat output and load=X [13:23:38] when a server gains more to do, cpu speed increases closer to maximum mhz and heat output, gets more done per cycle, and load still = X [13:24:34] (and of course describing that just in terms of "cpu speed" is naive. the mhz is just one factor, also how often various process C-state sleeps are used, and sometimes the power management avoids certain cores entirely vs waking them all up lightly, etc) [13:25:06] it's hard to even have a good conception of cpu load variance at all until you get up into the thermally-limited territory [13:29:42] oh yeah I wasn't thinking of frequency scaling [13:31:44] the marketing speed of these CPUs is 2.5Ghz. lvs3002 currently claims running @ 1199.951Mhz and lvs3004 @ 2799.987Mhz [14:21:39] ema: btw I downtimed all the cache clusters' ipsec just now, for the cp1075-99 stuff, for the next ~26h [14:21:51] so you don' thave to mess with them for the reinstalls for now [14:22:07] (unless you feel like it!) [14:39:23] I don't! :) [15:57:59] 10netops, 10Operations: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147 (10Papaul) [16:37:20] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Ok I think I have finally get something :) So I left tcpdump to capture ipv6 traffic excluding some "known" IPs like puppetmas... [16:42:26] 10Wikimedia-Apache-configuration, 10Operations, 10WMF-Communications, 10wikimediafoundation.org: Update redirect for jobs.wikimedia.org - https://phabricator.wikimedia.org/T200951 (10Aklapper) [16:50:50] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Tried to find all the occurrences of webproxy and added the related https configuration, let's see if things will change! [16:54:23] 10netops, 10Operations, 10cloud-services-team, 10ops-codfw: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Papaul) Port information labtestnet2002 rack B1 ge-1/0/16 labtestnet2003 rack B1 ge1/0/17 ''' [edit interfaces interface-range cloud-instance-po... [17:02:25] 10netops, 10Operations, 10cloud-services-team, 10ops-codfw: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Papaul) a:05Papaul>03Andrew @Andrew both ports are up and in the cloud-instance-ports interfaces ranges. Please check if everything looks good, y... [17:10:26] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) [17:22:17] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1076.eqiad.wmnet'] ``` The log can be found in `/var/log/w... [18:06:35] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Bouncing the network ports of elastic1049 and elastic1038, solved the issue. [18:14:11] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1076.eqiad.wmnet'] ``` and were **ALL** successful. [18:56:26] 10Traffic, 10Varnish, 10Wikimedia-Apache-configuration, 10Operations: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629 (10matmarex) [19:17:57] 10Traffic, 10DNS, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson) [19:18:48] 10Traffic, 10DNS, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson) a:05Cmjohnson>03RobH This is ready for install, assigning to @robh for help. [19:47:40] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1077.eqiad.wmnet', 'cp1078.eqiad.wmnet', 'cp1079.eqiad.wmn... [19:56:29] 10Traffic, 10DNS, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Cmjohnson) [19:56:57] 10Traffic, 10DNS, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Cmjohnson) a:05Cmjohnson>03RobH assigning to @robh to help complete the install [20:37:59] 10netops, 10Operations: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097 (10ayounsi) p:05Triage>03Normal [20:38:47] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10ayounsi) [20:48:58] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1080.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['cp1080.eqiad.wmnet'] ``` [20:51:08] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1083.eqiad.wmnet', 'cp1084.eqiad.wmnet', 'cp1085.eqiad.wmn... [21:22:39] bblack: https://librenms.wikimedia.org/device/device=162/tab=port/port=16590/ [21:22:50] got an alert about the spike of interfaces errors [21:23:43] you're working on it or the SFP is faulty? [21:28:52] XioNoX: I've been trying to install it and figure out why it won't PXE [21:28:55] guess I know now :) [21:29:51] hello new DAC [21:31:45] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) a:05BBlack>03Cmjohnson Most of these are installed now, but 2x have initial hardware issues: * cp1080 - Reports uncorrectably-bad DIMM in slot A5 on bootup... [22:10:11] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @Krinkle Good answer, both in terms of the information that I now have,... [22:33:51] 10Traffic, 10DNS, 10Operations, 10ops-eqiad: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10RobH) [23:17:24] 10Traffic, 10DNS, 10Operations: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10RobH) a:05RobH>03None [23:18:17] 10Traffic, 10DNS, 10Operations: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10RobH) So this is ready for someone in #traffic to take over, and migrate authoritative dns services from radon.wikimedia.org. Then we can decom old system radon. [23:34:03] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Krinkle) For crons of this kind, we tend to use `foreachwiki`, or `mwscriptwikise...