[08:37:04] testing the 'exp' cache admission policy on cp2027 (text) and cp2028 (upload) [08:37:39] I've restarted varnish-fe on both and on cp2028/cp2030 to clear the caches and have some other host to compare against [08:37:46] (see SAL) [08:38:42] currently staring at the cache-hosts-comparison dashboard to see differences :) [08:38:49] https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-comparison?orgId=1&from=now-30m&to=now&var-site=codfw%20prometheus%2Fops&var-instance=cp2027&var-instance_b=cp2028 [08:39:56] to compare apples with apples see cp2027 vs cp2029 (text) and cp2028 vs cp2030 (upload) [08:40:43] what we care about is hitrate (obviously) but also varnish transient memory usage [08:41:08] the latter is at the bottom of the dashboard under 'Cache Usage' [08:44:59] we're doing this, among other reasons, to see if the exp policy helps in the long run to deal with varnish transient memory exhaustion on cache_upload https://phabricator.wikimedia.org/T249809 [08:45:55] (which is important because when that area of memory is full we return 503 errors) [08:56:21] so far the change seems to be helping: [08:56:25] https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-comparison?panelId=8&fullscreen&orgId=1&from=now-30m&to=now&var-site=codfw%20prometheus%2Fops&var-instance=cp2028&var-instance_b=cp2030 [08:56:47] transient usage is a little higher when using 'exp', but we don't seem to be getting the spikes [09:04:21] at this point, hitrate is comparable on upload (32.5% with exp and 32.6% with the policy based on static size cutoff) and improved on text (64.1% exp 63% static) [09:05:45] much lower cache usage on cache_upload too (1.9G exp 4.4G static) [09:06:15] very exciting! [09:30:19] 10HTTPS, 10Traffic, 10DBA, 10Operations, and 4 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10Marostegui) For the record: https://gerrit.wikimedia.org/r/#/c/operations/software/tendril/+/594412/ https://gerrit.wikimedia.org/r/#/c/operations/... [09:31:51] 10Traffic, 10Operations, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) @dpifke once deployed, we will need to nuke the existing data for navtiming_responsestart_by_host_seconds on Prometheus. Otherwise it's going to... [09:34:29] 10Traffic, 10Operations, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) Put together a dashboard (with the underlying labels swapped for now): https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?orgId=1 [11:37:26] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) >>! In T251726#6103487, @Vgutierrez wrote: > Currently we're using the LE unified cert on the US DCs (codfw, eqiad and ulsfo). LE certs are va... [11:45:57] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10ayounsi) Cabling diagram, let me know if something is missing or unclear: {F31803448} [12:39:57] a few hours into the admission policy experiment, results for cache_upload are pretty good: +3% frontend hitrate using less than 1/3 of the memory. Transient usage is high but looks somehow more regular than with the static admission policy [12:41:11] on text hitrate improvement is ~1%, using only very slightly less memory (5.3G vs 5.9G), while transient usage is significantly higher (125M vs 21M at the moment) [13:32:00] 10Traffic, 10Operations, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10BBlack) >>! In T251726#6108138, @Dzahn wrote: > Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the expira... [13:49:03] 10Traffic, 10Operations, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) Or we could make a new Icinga check that isn't check_http for a specific service but runs openssl directly on the cert file in the private repo and has a generic nam... [15:09:35] 10Traffic, 10Operations, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) >>! In T251726#6108138, @Dzahn wrote: > Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the ex... [17:31:14] 10Traffic, 10DNS, 10Operations, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10Varnent) [17:38:04] 10Traffic, 10Anti-Harassment, 10Operations: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:38:25] 10Traffic, 10Anti-Harassment, 10Operations: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:39:23] 10Traffic, 10Anti-Harassment, 10Operations: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:40:06] 10Traffic, 10Anti-Harassment, 10Operations: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:41:55] 10Traffic, 10Anti-Harassment, 10Operations: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [18:04:41] 10Traffic, 10Anti-Harassment, 10Operations, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10CDanis) [18:05:17] 10Traffic, 10Anti-Harassment, 10Operations, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [19:49:05] 10Traffic, 10DNS, 10Operations, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) p:05Triage→03High [19:57:00] 10Traffic, 10DNS, 10Operations, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) The requested day and time will determine who is available to assist you with this task. Please let us know the details as soon as you have... [19:58:38] 10Traffic, 10Anti-Harassment, 10Operations, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10colewhite) p:05Triage→03Medium