[05:04:29] that's pretty cool: https://docs.google.com/spreadsheets/d/18ztPX_ysWYqEhJlf2SKQQsTNRbkwoxPSfaC6ScEZAG8/edit#gid=0 [05:04:40] IXP megabit/sec cost & comparison [05:16:30] this too http://www.dca.fee.unicamp.br/~chesteve/ppt/sbrc15-ptt-slides.pdf [05:16:48] "Anatomy of Internet eXchange Points (IXP) Ecosystem in Brazil" [06:23:04] <_joe_> ema, vgutierrez can we merge https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/507740 today? [06:23:26] <_joe_> I am sorry to be pressing, but it's kinda important [06:23:46] <_joe_> I can take care of the deployment and backporting and all [07:27:27] 10Traffic, 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10MoritzMuehlenhoff) a:03Papaul [08:09:45] _joe_: certainly! [08:09:53] <_joe_> <3 [08:10:15] I think vgutierrez wanted to also cherry-pick other patches though so maybe let's wait for him? [08:11:05] <_joe_> sure [08:13:00] well [08:13:06] the k8s support [08:13:11] but up to joe [08:13:21] he is the stakeholder for that as well [08:19:15] <_joe_> let [08:19:17] <_joe_> 's add it [08:19:21] <_joe_> it's an addition, so no risk [08:20:17] ack [08:22:50] so feel free to merge your CR to master [08:23:01] an I'll take care of releasing a new version [08:23:08] during the day [08:23:17] that works for you _joe_ ? [08:23:28] <_joe_> yes [08:23:35] <_joe_> I wanted a +1 on the amended version [08:23:51] <_joe_> and it came :) [08:23:52] you have two now I think [08:24:02] <_joe_> sure sure I was explaining [08:24:29] <_joe_> oh we have gate and submit on pybal now [08:24:33] <_joe_> we got fancy! [08:24:39] <_joe_> I mean it also has tests now [08:24:49] <_joe_> tsk, I remember changing it in the old days [08:24:52] "now" [08:25:06] <_joe_> well 5 years ago there was no test :) [08:25:33] what do we do again when ssh to mgmt IP fails? [08:25:48] cp1083.mgmt.esams.wmnet seems broken [08:25:59] ema: check ipmi page on wikitech [08:26:08] volans: good morning! Thank you! [08:26:15] <_joe_> ema: [08:26:21] <_joe_> 1083.esams? [08:26:26] <_joe_> that seems wrong [08:26:30] ipmi is an alias but easier to remember [08:26:36] rotfl that too [08:26:40] <_joe_> I suspect a level 8 issue there [08:26:49] ok I go have a coffee [08:26:53] bbl [08:26:56] <_joe_> ahahah [08:26:58] <_joe_> :* [08:31:55] 10Traffic, 10Operations, 10ops-eqiad: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) [08:32:25] 10Traffic, 10Operations, 10ops-eqiad: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) [08:42:59] 10Traffic, 10Operations, 10ops-eqiad: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) Interestingly, there was a memory usage spike right before the host crashed. {F28951427} [08:53:34] 10Traffic, 10Operations: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10ema) I've ack'ed the warnings in Icinga for the time being. [09:49:13] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ema) 05Open→03Resolved a:03ema All Varnish backends in ulsfo upload replaced with ATS. [10:49:35] 10Traffic, 10Operations, 10serviceops, 10PHP 7.2 support, 10User-jijiki: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10jijiki) [10:49:47] 10Traffic, 10Operations, 10serviceops, 10PHP 7.2 support, 10User-jijiki: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10jijiki) p:05Triage→03High [10:50:22] 10Traffic, 10Operations, 10serviceops, 10PHP 7.2 support, 10User-jijiki: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10jijiki) [12:10:41] 10Traffic, 10Operations, 10ops-eqiad: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10CDanis) >>! In T222620#5163577, @ema wrote: > Interestingly, there was a memory usage spike right before the host crashed. > > {F28951427} I think that is just a strange monitoring artifact. If you zoom in... [13:11:42] 10Traffic, 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10ema) 05Open→03Resolved IPMI seems to be working remotely: ` $ sudo ipmitool -I lanplus -H "cp2009.mgmt.codfw.wmnet" -U root -E chassis power status Unable to read password from... [13:20:10] _joe_: can you double-check https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/508569 and the related changes? [13:20:53] I've cherry-picked the changes related to ProxyFetch tests, cause 1.15 doesn't have tests at all for ProxyFetch [13:21:04] it seemed interesting :) [13:23:09] <_joe_> I'll look [13:23:12] thx [13:26:28] ooooh vgutierrez I'm interested in that change for Prometheus [13:26:44] cdanis: context please? :) [13:27:15] we run two independent replicas of Prometheus in each of codfw/eqiad [13:27:45] but really, each 'instance' is an apache in front of ~half a dozen different Prometheus daemons serving different data on different paths [13:28:13] so being able to test that each of those is up in pybal is appealing [13:31:40] 10Traffic, 10Operations, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1071.eqiad.wmnet', 'cp1072.eqiad.wmnet... [14:06:23] 10Traffic, 10Operations, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1074.eqiad.wmnet', 'cp2003.codfw.wmnet... [14:19:23] The following packages have unmet dependencies: [14:19:23] pybal : Depends: python-treq but it is not installable [14:19:26] wonderful... [14:20:53] so I think I'm going to revert the k8s cherry-pick _joe_ [14:21:13] considering that you need your change deployed ASAP [14:23:34] <_joe_> vgutierrez: I can survive without but yes [14:23:41] <_joe_> let's package treq with ease [14:25:03] so..? revert or package treq? [14:25:38] err.. https://packages.debian.org/source/stable/python-treq [14:25:40] wtf? [14:26:10] so it just needs to be backported to jessie [14:28:48] wait, did we introduce a new dependency? [14:29:15] yes [14:29:29] the k8s feature requires python-treq [14:29:41] some idiot (me) forgot about it [14:30:43] taking into account that python-treq in stretch is on version 15.x and twisted on jessie is on 14.x I don't know if it's going to be a trivial change [14:31:56] and again, considering that the k8s support is not indispensable right now, I'll be more comfy reverting that commit from the 1.15 branch [14:32:01] +1 [14:35:33] +1 [14:43:39] +1 [14:43:43] sigh [14:43:53] May 07 14:43:13 pybal-test2001 pybal[18346]: factory = client.HTTPClientFactory(url, *args, **kwargs) [14:43:53] May 07 14:43:13 pybal-test2001 pybal[18346]: exceptions.TypeError: __init__() got an unexpected keyword argument 'reactor' [14:43:57] first release after 13 months [14:44:06] yepp [14:44:13] obviously it's going to be painful [14:44:15] probably my fault! [14:44:21] so feel free to curse all you want at me [14:45:45] or maybe I was too aggressive cherry-picking [14:45:57] let's skip the blaming and let's just fix it :) [14:45:57] <_joe_> uhm that looks like my doing tbh [14:47:44] <_joe_> uhm no [14:47:55] I don't think so [14:47:56] https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/508569/2/pybal/monitors/proxyfetch.py [14:48:34] <_joe_> yes I somehow added reactor= [14:48:36] <_joe_> sigh [14:48:58] <_joe_> brainfart sorry [14:49:23] <_joe_> should I write a patch? [14:50:09] I got it [14:50:11] <_joe_> I'm dealing with another fire just now [14:50:29] I'll patch it manually on pybal-test2001 [14:50:34] <_joe_> yep [14:50:39] <_joe_> just remove that reactor= [14:53:24] way better :) [14:53:41] I think it's partially my fault [14:53:52] cause I've skipped one reactor related commit by mark [14:54:02] <_joe_> ohh I see [14:54:13] <_joe_> I did saw the reactor in the original patch [14:54:30] <_joe_> it's a bit bad tests didn't catch this [14:54:34] that's right [14:54:39] <_joe_> but yeah, we mock too much [14:54:41] so I'll submit the patch for 1.15 [14:55:52] mayb [14:55:54] just maybe [14:55:59] i'll take friday to work on pybal [14:56:06] it's been like 6 months I think :( [14:56:46] * vgutierrez is on holidays [14:56:50] (on Friday) [14:56:54] i might too [14:56:57] it's my birthday [14:56:59] /o\ [14:57:01] can I do something fun on my birthday? :) [14:57:09] of course, come to CPH with me [14:57:33] http://mikkeller.dk/event/mbcc-mikkeller-beer-celebration-copenhagen-2019/ [14:57:40] heh [14:57:43] unfortunately I can't travel atm [15:00:21] 10Traffic, 10Operations, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1072.eqiad.wmnet', 'cp1073.eqiad.wmnet', 'cp1071.eqiad.wmnet'] ` and were **ALL** su... [15:03:57] 10Traffic, 10Operations, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1074.eqiad.wmnet', 'cp2021.codfw.wmnet', 'cp2015.codfw.wmnet', 'cp2003.codfw.wmnet']... [15:03:57] _joe_, jijiki could you provide me with a valid config using the new feature to test it on pybal-test2001? [15:04:05] with one URL it looks happy right now [15:04:31] <_joe_> so you want a second url? [15:04:35] <_joe_> what was the first? [15:05:05] let me paste the current pybal-test2001 config [15:05:08] <_joe_> so a good config to test 2 urls would be [15:05:13] <_joe_> ack [15:05:14] https://www.irccloud.com/pastebin/I535Lh51/ [15:05:30] <_joe_> add to proxyfetch.url [15:05:35] <_joe_> http://www.wikipedia.org/wiki/it:Francesco_Totti [15:05:40] <_joe_> and then [15:05:49] <_joe_> proxyfetch.check_all = true [15:06:03] <_joe_> then we can try with a failing url [15:07:18] 10Traffic, 10Operations, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2009.codfw.wmnet'] ` The log can be fo... [15:09:01] vgutierrez@pybal-test2001:/etc/pybal$ curl http://127.0.0.1:9090/metrics 2>/dev/null|grep -i proxyfetch |grep -i blankpage |wc -l [15:09:01] 71 [15:09:01] vgutierrez@pybal-test2001:/etc/pybal$ curl http://127.0.0.1:9090/metrics 2>/dev/null|grep -i proxyfetch |grep -i totti |wc -l [15:09:01] 71 [15:09:24] so pybal considers totti a failure [15:10:58] he looks ok to me, Coppa del Mondo FIFA del 2006 sounds ok [15:11:29] so a 302 is considered as a failure [15:11:44] and is reported like that [15:11:50] May 07 15:07:15 pybal-test2001 pybal[21052]: [apaches_80] ERROR: Monitoring instance ProxyFetch reports server mw2186.codfw.wmnet (enabled/up/pooled) down: 302 Found to http://it.wikipedia.org/wiki/France [15:12:02] May 07 15:07:25 pybal-test2001 pybal[21052]: [apaches_80 ProxyFetch] WARN: mw2195.codfw.wmnet (enabled/partially up/pooled): Fetch failed (http://www.wikipedia.org/wiki/it:Francesco_Totti), 0.033 s [15:12:10] cool [15:12:45] <_joe_> ok perfect [15:12:46] let's hit it.wikipedia.org directly [15:12:47] <_joe_> now [15:12:49] <_joe_> no no wait [15:12:52] sure [15:12:59] <_joe_> add proxyfetch.max_failures = 1 [15:13:09] <_joe_> that should let the server pooled [15:13:13] ack [15:13:14] <_joe_> even if one url fails [15:14:10] <_joe_> vgutierrez> so pybal considers totti a failure [15:14:27] the reason we used S:BP for the previous check was to avoid looking deeply into any misbehaviors (not that that means this approach is wrong, just providing some context) [15:14:32] <_joe_> you're aware this is considered blasphemy where I live right? [15:14:44] <_joe_> bblack: bp is a good compromise [15:14:59] basically we don't want pybal reacting to some deeper layer of failure it can't control (e.g. DB stuff) [15:15:01] _joe_: I'm ready to order pineapple pizza while saying that Totti sucks [15:15:17] only to whether it can barely reach the edge of the service. S:BP isn't perfect at that either of course [15:16:43] _joe_: max_failures = 1 works as expected [15:17:00] <_joe_> vgutierrez: you wouldn't survive the latter [15:17:08] <_joe_> great! [15:20:23] _joe_: any other test? [15:20:37] <_joe_> vgutierrez: let's try with two good urls [15:20:39] <_joe_> :D [15:20:41] ack [15:20:43] <_joe_> and max_failures = 0 [15:20:46] <_joe_> then we're all set [15:20:57] I guess I'll replace Totti with Iniesta [15:23:32] <_joe_> Don Andres is the only one that I would accept as a substitute, yes [15:24:19] so my brain has HSTS [15:24:34] and I've set https:// instead of http://, and of course is not happy [15:25:25] vgutierrez@pybal-test2001:/etc/pybal$ curl http://127.0.0.1:9090/metrics 2>/dev/null|grep -i proxyfetch |grep -i totti|tail -1 [15:25:25] pybal_monitor_proxyfetch_request_duration_seconds{host="mw2167.codfw.wmnet",monitor="ProxyFetch",result="successful",service="apaches_80",url="http://en.wikipedia.org/wiki/Francesco_Totti"} 0.26462697982788086 [15:25:25] vgutierrez@pybal-test2001:/etc/pybal$ curl http://127.0.0.1:9090/metrics 2>/dev/null|grep -i proxyfetch |grep -i blankpage|tail -1 [15:25:25] pybal_monitor_proxyfetch_request_duration_seconds{host="mw2224.codfw.wmnet",monitor="ProxyFetch",result="successful",service="apaches_80",url="http://en.wikipedia.org/wiki/Special:BlankPage"} 0.1107170581817627 [15:25:37] two successful URLs [15:25:50] _joe_: everything looks as expected [15:26:17] feel free to +1 https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/508596 [15:26:50] <_joe_> great [15:27:29] BTW, take into account that max_failures=1 and one URL failing it would generate a crazy amount of log without being noticed [15:28:29] <_joe_> I don't plan to use it [15:30:26] _joe_: are you going to take care of the puppet side? [15:30:33] <_joe_> yes [15:30:41] <_joe_> I also have to modify the apache vhosts [15:30:47] ack [15:30:55] I'll release 1.15.6 now [15:31:00] <_joe_> but this way we can continue ramping up traffic [15:31:04] <_joe_> <3 [15:31:13] <_joe_> thanks for the help, wikilove [15:31:43] _joe_: I'll be in Rome in two weeks, I accept beer better than love [15:31:46] }:) [15:32:34] <_joe_> italian beer? [15:32:38] <_joe_> you asked for it, not me [15:33:13] I already told you, there are some nice italian birrificios [15:33:21] like birrificio Elav in Bergamo [15:33:29] I'm pretty sure that we can find something nice in Rome as well [15:34:04] <_joe_> I know one [15:34:17] <_joe_> where the owner is a friend [15:34:23] "Ma che siete venuti a fa’, Via di Benedetta 25, Trastevere (this should be your first stop: it’s Rome’s ultimate beer geek spot) [map]." [15:34:28] <_joe_> and quite the expert [15:34:33] <_joe_> mutante: meh [15:34:41] hehe, i wanted to see your reaction :) [15:34:45] ahahhaha [15:34:47] <_joe_> I find them posh and hipster [15:34:52] <_joe_> a tad too much [15:34:53] sounds good [15:34:54] * vgutierrez hdies [15:34:57] <_joe_> ahahahah [15:34:57] *hides [15:35:03] when it said "ultimate geek" i assumed what you said :) [15:38:51] _joe_: done.. after the puppetization is done we can update pybal on the servers [15:39:01] _joe_: but I'd say that tomorrow EU morning would be better [15:39:25] <_joe_> vgutierrez: yeah well I guess first we need to deploy the new version, then modify the config [15:39:44] <_joe_> and yes, tomorrow morning [15:39:53] <_joe_> I'm going off in a few [15:40:03] I like my chances to win t-shirts with the coffee deposit full [15:49:55] 10Traffic, 10Operations, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2009.codfw.wmnet'] ` and were **ALL** successful. [17:19:50] 10Traffic, 10Operations, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10aborrero) After the trimmed interface name, we had to generate a `/etc/network/interface` file like this by hand for the config to survive a reboot: ` auto p175s0f1d1.1105... [17:23:36] 10Traffic, 10Operations, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) As discussed on IRC, using vlan-raw-device enp175s0f1d1 should be enough, as recommended in https://wiki.debian.org/NetworkConfiguration#Manual_config [17:28:30] 10Traffic, 10Operations, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10aborrero) >>! In T209707#5165188, @Vgutierrez wrote: > As discussed on IRC, using vlan-raw-device enp175s0f1d1 should be enough, as recommended in https://wiki.debian.org/Ne... [18:27:42] 10Traffic, 10Operations, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) so taking a deeper look into https://manpages.debian.org/jessie/vlan/vlan-interfaces.5.en.html: > vlan-raw-device devicename > Indicates the device to create the... [22:18:20] that's interesting too https://github.com/NLnetLabs/routinator/issues/110