[00:01:17] for some reason i have a hard time uploading to gerrit, but doing that [00:02:50] atgomez: I made a patch for including a note about the image authorship [00:04:09] getting the CSS to display similar as before was tricky :P [00:18:17] done. i merged https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473306/2/modules/profile/templates/bienvenida/apache-bienvenida.wikimedia.org.erb and ran pupppet on both backends [00:18:25] that should be 1hour cache now [00:18:42] we already had mod_headers loaded [00:37:17] Platonides: we'll need prateek to review all changes [00:37:28] thank you, let's follow up on the phab task [10:24:40] please welcome fifo-log-demux! https://gerrit.wikimedia.org/r/#/c/operations/software/fifo-log-demux/+/473432/ [10:26:59] <_joe_> ema: I wonder who pointed you to that project :P [10:28:04] _joe_: it wasn't google! :) [10:28:42] 🤔 [10:29:56] ema: no direct imports from the HEAD of a branch on github... I'm disappointed :-P [10:30:07] <_joe_> ema: I have some questions as maybe my go is bad [10:33:39] _joe_: ? [10:33:55] <_joe_> ema: writing on the CR :P [10:33:58] ah! :) [10:46:27] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) [10:52:17] <_joe_> ema: add logging and prometheus metrics :D [11:01:09] nice... Notice: /Stage[main]/Certcentral::Server/Exec[/usr/local/bin/certcentral-certs-sync]/returns: executed successfully [11:02:47] cool [11:08:03] 10Certcentral, 10Patch-For-Review: switch certcentral servers from active/active to active/passive - https://phabricator.wikimedia.org/T209161 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [11:41:49] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10MoritzMuehlenhoff) Icinga is flagging broken memory on 1053, simply leaving a note here as that host is up for decom anyway. [11:49:43] 10Certcentral: store non-config files in /var/lib/certcentral - https://phabricator.wikimedia.org/T209475 (10Vgutierrez) [14:14:30] _joe_: https://i2.kym-cdn.com/photos/images/original/000/418/257/dfe.jpg [14:16:50] <_joe_> ema: re: scalability, I think it's ok given your tests [14:17:45] <_joe_> I thought about it and we could do without locks but the code would be much more complex [14:27:27] so now are we using go? <3 [14:53:09] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) Thank you @elukey! [15:00:11] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10MasinAlDujailiWMDE) Did someone ask for a zone file? I have a zone file! Here, take a zone file! ;-) {F27219415} [15:03:24] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) 05stalled>03Open [15:04:24] good question, re: go :) [15:05:10] we do have some formal idea that python is our language of choice for SRE, all other constraints being equal [15:05:43] (just for all the good reasons that a proliferation of N languages to CI and code-review and learn makes life hard) [15:06:19] but also part of the rationale at the time, I think, was to have a stick to beat bash scripts with and minimize their usage (which is probably the worst choice for anything complicated) [15:06:58] we make many exceptions, where e.g. a certain language has better library support for the task at hand or whatever [15:07:42] (cf icinga and perl scripts) [15:08:21] ditto the ruby bolted on top of puppet [15:08:23] I think go is interesting and shiny, but I think if that's our only justification for using it, that line of reasoning could lead to an explosion of interesting languages deployed [15:08:27] as FYI in T196066 we (analytics) are working on a varnishkafka prometheus exporter to finally ditch logster/graphite/etc.. in python3, let us know if this is not good :) [15:08:27] T196066: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 [15:09:19] (of course we'll involve you guys in code reviews as early as possible) [15:11:02] and then in the totally opposite direction of the above argument (yet another non-python solution, but in an even worse language for maintainability): [15:11:07] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) [15:12:43] things like "a simple socket i/o multiplexer" seem like one of those problem domains where, in my past experience, using anything but C causes more pain than gain. Broadly speaking, most of the higher-level wrappers various $langs provide around the sys/libc calls necessary for that get some detail or other wrong for deep use-cases (e.g. ignore some important errno or whatever), and you'll event [15:12:49] ually run into difficult debugging. [15:13:14] but I don't think that's a strong argument for switching to C from the start. Maybe later if/when annoying bugs happen and the abstraction proves to be problematic! [15:16:34] relatedly musing: I wonder if go (or any other abstractions) let you use splice(2), which lets you copy data from one socket to another directly in the kernel without the kernel->userspace->kernel inefficiencies! :) [15:40:39] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez Recabling done as you requested on both servers [15:54:57] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [16:02:05] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [16:57:32] <_joe_> bblack: FWIW - several of us know (at least some) go, and it is a language we need to be accustomed to, because kubernetes/docker/etcd [16:59:47] <_joe_> on a different note; ema, bblack we need to start working on https://phabricator.wikimedia.org/T206339 [16:59:59] <_joe_> everything else is in place at this point [17:00:51] ok [17:01:25] the straw pseudo-code still seems like a legit approach to me, but in practice there's some rough edges to it. chiefly, that cross my mind immediately: [17:02:12] 1) Integrating it properly with the Cookie vs X-Orig-Cookie or whatever we do for general cookie-hacking on cache_text (it may be we need to look at X-Orig-Cookie, depending on where this is all injected at) [17:02:49] <_joe_> bblack: If I need to look at X-Orig-Cookie on the appservers, it's ok btw [17:03:19] 2) We probably want to handle the Vary modification with more grace, as the approach there will effectively invalidate all cache contents on deploy. Probably the right answer is to only use the special header (x-use-engine) for the minority (php7) case. [17:03:41] _joe_: that all gets fixed back up transparently before it hits appservers, just VCL-internal concerns [17:03:50] <_joe_> ack [17:04:40] <_joe_> bblack: any idea when you or ema could work on this? I'm happy to help but I'm a couple years of evolution removed from our vcl code [17:04:59] <_joe_> so I'd really prefer if one of you would drive the work [17:05:05] yeah [17:05:32] what's a reasonable deadline (not omg emergency deadline, but more like a deadline that makes things easy on your end)? [17:06:30] <_joe_> the FR freeze, I guess [17:06:42] <_joe_> which happens... I dunno when [17:06:45] lol [17:06:49] <_joe_> :P [17:06:51] <_joe_> let me check [17:06:56] yeah I've asked that question too, I got vague answers [17:07:03] <_joe_> oh ok [17:07:11] <_joe_> then let's assume dec 8th? [17:07:16] ok [17:07:19] <_joe_> or something in that ballpark [17:07:27] <_joe_> if this is not too tight ofc [17:07:36] <_joe_> in theory we have time until dec 31st [17:07:37] I think we can make that work [17:08:48] 10Traffic, 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack) From IRC for posterity: ` 17:01 < bblack> the straw pseudo-code still seems like a legit approach to me, but in practice there's some rough edges to it. chiefly, that cross my mind imme... [17:09:38] <_joe_> great, thanks! [17:09:54] <_joe_> I might be done with the boring parts of this goal sooner than I expected [17:10:03] <_joe_> then it's going to be benchmark time :) [17:11:37] benchmarks are easy to fake anyways, make them awesome :) [17:12:45] QCI slides from the future: "PHP7 pageviews frobulate* 7534% faster!" [17:12:50] <_joe_> well it's mostly going to be trial-and-erropr on how to tune php-fpm [17:13:03] (fine print in speaker notes: "frobulate means fetch the current unix timestamp") [17:13:08] <_joe_> I think we're in the 2-3% faster :) [17:13:17] <_joe_> *ballpark [17:19:23] 10Traffic, 10Operations, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) [17:20:34] 10Traffic, 10Operations, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) 05Open>03Resolved [17:35:09] 10Traffic, 10Operations: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) p:05Triage>03Normal [17:37:48] 10Traffic, 10Operations: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) [17:48:49] 10Traffic, 10Operations: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) Also, we should pre-downtime the unified ssl checks in icinga early next week before the US Thanksgiving holidays, so that nobody's pestered by a spam of WARNING alerts, which I believe are set to tri... [20:06:57] bblack: can we do the next webp threshold lowering tomorrow? [21:21:26] 10Traffic, 10Operations, 10Performance-Team: Stop oversampling Asian countries - https://phabricator.wikimedia.org/T204365 (10Imarlier) 05Open>03Resolved Resolved a long time ago, just forgot to close out the ticket. [21:54:50] gilles: yes, I think so! [21:55:39] ok then, ping me when you're available tomorrow