[01:59:20] 10Acme-chief, 10Patch-For-Review: Add simple script for account creation - https://phabricator.wikimedia.org/T207372 (10Krenair) well, ideally it would've been a script applicable to all installs of the package, not just in wikimedia puppet.git [02:34:12] 10Acme-chief, 10Beta-Cluster-Infrastructure: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12 - https://phabricator.wikimedia.org/T271778 (10Krenair) [02:34:54] 10Acme-chief, 10Beta-Cluster-Infrastructure: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12 - https://phabricator.wikimedia.org/T271778 (10Krenair) [02:37:08] 10Acme-chief, 10Beta-Cluster-Infrastructure: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12 - https://phabricator.wikimedia.org/T271778 (10Krenair) re acme-chief part: It looks like the same thing happened to the mx and wikibase certs too. Haven't checked those updated on the machines that... [10:47:59] Amir1: hello! :) [10:48:17] Amir1: ok to merge the wdqs ats patch now? [11:27:40] 10Traffic, 10SRE: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) p:05Medium→03High Lowering the number of Lua states on cp3050 did [[ https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-comparison?viewPanel=90&orgId=1&var-site=esams%20prometheus%2Fops&var-... [11:28:22] 10Traffic, 10Performance-Team, 10SRE: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) [12:44:09] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10AlexisJazz) [12:46:09] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10AlexisJazz) [12:59:38] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10RhinosF1) Certificates last 3 months so probably similar issues [13:00:23] vgutierrez: I think krenair asked you about ^ last time [13:21:11] hmm that's basically ats not being effectively reloaded? [13:21:17] at least that was the case last time [13:29:17] vgutierrez: I'd guess so. Someone will need to do so I assume. [13:29:32] root@deployment-cache-upload06:/etc/acmecerts/unified/live# openssl x509 -dates -noout -in rsa-2048.crt [13:29:32] notBefore=Jan 12 01:23:09 2021 GMT [13:29:32] notAfter=Apr 12 01:23:09 2021 GMT [13:29:43] sigh [13:30:04] That's renewed so yeah [13:30:10] and done [13:31:07] vgutierrez: I still get cert expired when browsing [13:31:26] RhinosF1: try again please [13:31:47] vgutierrez: that's working now [13:32:00] Guess we should monitor/fix that [13:32:12] So it reloads itself [13:32:49] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10Vgutierrez) `root@deployment-cache-upload06:/etc/acmecerts/unified/live# openssl x509 -dates -noout -in rsa-2048.crt notBefore=Jan... [13:33:12] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10RhinosF1) 05Open→03Resolved a:03Vgutierrez Fixed per discussion on #wikimedia-traffic at least for another 90 days. [13:33:45] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10RhinosF1) But yeah I guess this should be fixed/monitored better so it doesn't need manual reload. [15:38:51] ema: sorry, I was afk. Yes it is [15:40:29] Amir1: k, merging! [15:41:11] Thanks! [15:41:50] * addshore watches [15:41:52] Amir1: want me to force a puppet run in esams so you can check that things work as expected? [15:41:57] yeah [15:42:02] I was about to ask :D [15:42:05] This is exciting! [15:42:22] addshore: exciting would be an understatement for me :P [15:43:23] running puppet [15:46:30] Amir1, addshore: done [15:48:02] it looks fine, the js assets get "hit-front" but adding random things to url still returns it [15:48:10] https://usercontent.irccloud-cdn.com/file/zyVlAINX/image.png [15:49:34] the ui and query works fine [15:49:40] let me see the graphs [15:49:56] thats ain increase in 405s and 404s [15:51:32] I see some 405s for POST requests to https://query.wikidata.org/sparql [15:52:38] shouldn't it go to the wdqs backend? [15:55:00] any idea what the 404s are for? [15:55:06] I'm not sure where to see such logs [15:55:09] addshore: the thing is that wdqs.discovery.wmnet is the backend [15:55:27] we haven't remove gui from there yet so requests should not get 404s [15:55:52] yes, very true [15:58:30] Was that just interesting timing? or? I'm still a bit confused, everything seems to be fine from what I can see, but those 404s and 405s happened at the same time? [15:59:23] addshore: I can't use query.wikidata.org [15:59:23] if it gets higher in the next three minutes (when the rest catch up) then it's not coincidence [15:59:40] dcausse: what's happening? [15:59:45] no clue [15:59:52] 404? [16:00:08] looks like I'm getting to the UI instead of sparql results [16:00:44] dammit, yes, I can reproduce now [16:00:46] that might be the patch I did. Can you give steps to reproduce [16:00:59] Amir1: for me head to the UI and run any query [16:01:24] https://query.wikidata.org/sparql?query=%23Cats%0ASELECT%20%3Fitem%20%3FitemLabel%20%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ146.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%2Cen%22.%20%7D%0A%7D [16:01:30] open query.wikidata.org and run a query it'll hang for ever [16:01:31] it does look like the /sparql?query handler is just returning the UI page itself [16:01:32] I swear I did it and it worked [16:01:40] yeah [16:01:54] and then that produces a JS error: Uncaught TypeError: Cannot read property 'bindings' of undefined]\ [16:01:55] that's the patch, let me see how it's possible [16:01:58] yes [16:02:41] do you want me to revert the ats-be patch? [16:02:51] no no [16:02:53] found the issue [16:02:56] ack [16:02:56] I fix it now [16:04:11] I'm confused, if I POST to https://query.wikidata.org/sparql I gues 405 Not Allowed which I think is from the wdqs host? but if I GET it i get the UI ... [16:04:21] ema: https://gerrit.wikimedia.org/r/655697 [16:04:26] this should fix it [16:04:33] addshore: that explains why it broke [16:05:14] +1, Yes I see that now, that patch brings it in line with what is done for /bigdata/ldf [16:06:10] sorry that was my mistake [16:08:16] Amir1: I've rephrased the commit message to be a bit more specific, merging now [16:08:20] Sorry I was writing it so fast, couldn't come up with a good commit message [16:08:26] Thanks [16:10:43] alright so now we have: [16:10:45] map http://query.wikidata.org/sparql https://wdqs.discovery.wmnet/sparql [16:11:01] and likewise for /bigdata [16:11:06] yup [16:11:25] excellent, running puppet [16:15:48] done, please check again! [16:16:19] Looks good to me! [16:16:51] the numbers are recovering [16:17:56] nice [16:18:38] https://usercontent.irccloud-cdn.com/file/kWCTFzuB/yup%20%3A) [16:21:19] Amir1: you feel like closing the ticket and sending an internal email then? :) [16:21:38] sure thing! [16:38:59] ema: I poked around at some luajit stuff too at some point earlier today. my main output you might not think of yet: you might try disabling normalize-path on a node and seeing if it behaves differently (assuming we can consistently reload lua and get the same consistent results in general) [16:39:30] because luajit doesn't compile string.gsub, which was in one of your traces, and I think normalize_path is one of the main places gsub is used [16:39:54] (and it's possible jumping out to uncompiled code for things like gsub drives the jit malloc calls) [16:40:08] bblack: I thought about that too, but my current guess is that the opposite is happening: compilation is what causes the high system cpu usage due to mmaps [16:40:29] or are we using real closures in our lua anywhere? [16:40:34] the current experiment is having jit.off(true, true) at the beginning of all lua scripts on cp5008 [16:40:44] yeah that's going to be interesting! [16:40:53] and that effectively stops mmap from being called at all [16:41:22] I'm trying to think of reasons our lua would end up constantly compiling other than closure-like constructs [16:41:43] unless patterns for gsub/gmatch are effectively redefining a dynamic chunk of code on the inside (for the match state machine) [16:42:08] also lua pcall/error (catch/throw) can cause this [16:42:40] http://wiki.luajit.org/NYI is an interesting reference! [16:43:38] (which is where I found gsub/gmatch as "not compiled", which I think implies some overhead and possibly this mmap junk just jumping between the compiled JIT code and the interpreter stuff) [16:44:27] yeah, I was looking at the NYI page earlier last week and praying Dijkstra we don't have to reimplemtent gsub [16:46:12] it looks like for our use case the interpreter might be fast enough (< 1ms p75 for backend hits) -- and surely as it is now the compiler is doing more harm than good [16:47:35] perhaps what's happening is that the compiled code is somehow "thrown away" too early by tslua and we end up identifying hot paths and compiling all the time [16:49:25] on cp3050 for instance lj_vm_hotcall happens ~1k times per second [16:49:45] which is roughly the rps served by ats [16:51:00] and my understanding is that hot calls should be detected much less often than that: I imagine the JIT should find out that a given function is hot, compile it, and run it compiled for some reasonable amount of time [16:52:45] maybe there some jit tunable for the cache size for hot code/paths that needs increasing? [16:54:40] the luajit binary for running Lua code on the CLI does expose those, I think tslua.so does not [16:54:50] look at cpu usage with jit disabled on cp5008: [16:54:51] https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-comparison?viewPanel=90&orgId=1&from=1610468709837&to=1610470340567&var-site=eqsin%20prometheus%2Fops&var-instance=cp5008&var-instance_b=cp5009 [16:55:11] or again it could be closures that cause it [16:55:18] I found one closure in non-test lua code of ours [16:55:25] function do_global_read_request() [16:55:25] if ts.client_request.header['Host'] == 'healthcheck.wikimedia.org' and ts.client_request.get_uri() == '/ats-be' then [16:55:28] ts.http.intercept(function() [16:55:31] ts.say('HTTP/1.1 200 OK\r\n' .. [16:55:42] but surely healthcheck traffic isn't heavy enough to drive all that load compiling closures [16:57:10] ema: does it go back up if you turn jit back on? (or does cpu always drop off for a while on changes?) [16:59:14] but yeah if turning off jit results in better latency (even out at high p-values), that's the simplest answer :) [16:59:15] bblack: nope, CPU never stays down like that for that long after a restart [17:00:13] ema: fyi two crs for review https://gerrit.wikimedia.org/r/c/operations/puppet/+/651174 https://gerrit.wikimedia.org/r/c/operations/puppet/+/651171 (not urgent) [17:02:38] 10Traffic, 10Performance-Team, 10SRE: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) Disabling JIT in all Lua scripts on cp5008 resulted in ats-be not calling lj_vm_hotcall/mmap anymore and CPU usage [[https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-comp... [17:03:01] I've gotta run now, leaving jit (and puppet) disabled on cp5008 till tomorrow [17:03:40] feel free to re-run puppet and run ats-backend-restart if for some reason things break (I don't think they will) [17:03:49] ok [17:04:45] jbond42: tomorrow! :) [17:04:50] see you folks [17:05:11] ema: thanks enjoy your evening :) [22:03:43] 10Traffic, 10SRE: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10BBlack) There's some anomalies in network graphs on authdns1001 that I hadn't noticed until today, which go all the way back to Oct 26, which is pro... [22:08:16] my debian packaging process: start, get stuck on some inane error because ... debian packaging. find solution and fix, but don't document at that time. next time, repeat the same cycle [22:08:20] *sigh* [22:51:03] Is anyone around/interested in looking at a routing mystery? https://phabricator.wikimedia.org/T271867