[08:09:38] 10DBA, 10Analytics, 10Growth-Team, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10elukey) Needs to be coordinated between me and @mforns when he is back from vacations. Going to put this task in our Incoming Backlog column to get triaged by my team... [08:10:02] 10DBA, 10Analytics, 10Growth-Team, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10elukey) p:05Low>03Triage [08:18:20] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10jcrespo) [08:48:06] FYI: SREs will soon be doing a "live" test of the switchover process. It should without any repercussions but still keep it in mind. [08:49:11] :) [08:57:47] in particular this will wipe and warmup caches in codfw, set codfw masters in RO, set eqiad masters in RW and update tendril to have eqiad masters as the start of the tree [08:58:20] so basically will leave things as they are now [08:58:23] no? [08:58:50] yeah it should all be a noop ,but the queries will be executed [08:58:55] sure [09:34:03] volans: can you get debugging of executed actions? [09:34:39] jynus: sure, for the check of sync I've updated the paste, not sure if you saw it [09:34:50] https://phabricator.wikimedia.org/P7519 [09:35:10] if you want to follow along the test, ssh into sarin and tail the logs [09:35:28] /var/log/spicerack/sre/switchdc/mediawiki.log (INFO) or mediawiki-extended.log (DEBUG) [09:35:56] or you can attach to our tmux if you want cumin's output too [09:37:16] if they are kept on file, that is good enough [09:37:28] I can check it later [09:37:39] cumin's output I think not, I can save them though [09:38:40] so what I want is what was executed where? [09:39:18] jynus: we're running those, so if you want to check better to check them before running them :) [09:39:43] for the RW operations [09:39:48] the RO ofc doesn't matter [09:48:40] 10DBA, 10Operations, 10Puppet: Remove all usages of $::mw_primary on puppet - https://phabricator.wikimedia.org/T199124 (10akosiaris) 05Open>03Resolved $::mw_primary is removed from puppet now. Resolving this. [09:51:31] 10DBA, 10Operations, 10Puppet: Remove all usages of $::mw_primary on puppet - https://phabricator.wikimedia.org/T199124 (10jcrespo) Done at https://gerrit.wikimedia.org/r/457491 [10:00:56] jynus: meeting? [10:02:15] yes, was talking to faidon [10:02:24] sorry :) [10:27:18] jynus, marostegui: head's up, 03-set-db-readonly is about to be run (codfw -> eqiad) [10:27:29] ok! [10:30:04] DONE [10:35:52] jynus: (I guess you're in a meeting, so not urgent) I looked at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/345346/ and I couldn't understand why this needed an icinga replacement [10:36:52] I mean, if someone argued that the check shouldn't have a confctl/etcd dependency I could see that and its pros/cons, but I guess that's separate? [10:56:25] jynus, marostegui: about to run 06-set-db-readwrite (codfw->eqiad) [10:56:47] volans: ok! [11:04:49] paravoid: the script, after calling confctl [11:05:02] can only set the level to warning [11:05:18] because it cannot change the "page/not page" dynamically [11:05:28] ah! [11:05:30] now I get it [11:05:44] joe just proposed to have a specific "auto" way of doing things [11:05:55] that could work, but it is a hack on top of icinga [11:06:00] auto how? [11:06:00] not that I want to get rid of it [11:06:21] the contact will give a page or an email/IRC based on the confctl status [11:06:38] <_joe_> paravoid: make puppet configure the services to use contactgroup "db-core-$dc" [11:06:39] but icinga has some lacks in terms of api / dynamic configuration [11:06:49] indeed [11:06:49] _joe_: so still puppet :-) [11:07:02] another way to hack around this would be to set up two different checks [11:07:03] <_joe_> and swap the contactgroup configs on etcd changes via confd [11:07:28] paravoid: so my "replace icinga" would be some way to make configuration dynamic [11:07:29] <_joe_> in general I strongly oppose having checks depend on etcd [11:07:41] _joe_: but still need icinga reload right? [11:07:41] <_joe_> unless they're checks that involve etcd directly [11:07:44] one that's paging and results OK on non-primary DCs regardless, and one that's non-paging and warns/crits [11:07:44] even if we kept icinga [11:07:45] <_joe_> volans: yes [11:08:06] still a hack [11:08:08] paravoid: note this is only for this case [11:08:12] there are other conditions [11:08:22] for example- don't page if a server is depooled [11:08:25] nod [11:08:30] (maybe, it is an example) [11:08:36] so I consider this an architecture issue [11:08:45] but I don't have good solutions atm [11:09:01] on the other way, we have to do something to prevent false positives [11:09:05] nod [11:09:29] I would like to do ./depool and not worry about bothering other people [11:09:35] with maintenance, etc. [11:09:45] that would be great [11:09:50] (e.g. schema changes applied fleet-wide) [11:10:06] some of that, joe has been working hard [11:10:24] <_joe_> yes, icinga is hard to reconfigure based on things coming from etcd [11:10:26] again, hacking mediawiki/php lacks [11:10:27] so this is not a solution to this at all, but have you thought about changing the text to at least broadcast the status at the time of the page? [11:10:45] paravoid: I thought about set it as warning [11:10:49] that can be done dynamically [11:10:54] it is the patch I Sent you [11:10:58] <_joe_> but that requires querying etcd from the check [11:11:00] i.e. db2088 (s7 slave), or db2088 (s7 depooled slave), or db1034 (s1 master) [11:11:01] yep [11:11:03] <_joe_> I didn't like it [11:11:14] <_joe_> conceptually, not the patch [11:11:18] <_joe_> the patch was fine IIRC [11:11:24] yeah, the patch was a prof of concept [11:11:38] I have a better mariadb check in python in the works [11:11:43] _joe_: what would you prefer? [11:11:49] but it was a quick perl hack [11:12:09] paravoid: there is another thing- some of the checks are ok to not be dynamic [11:12:18] for example, the current status [11:12:35] we don't care if pages done ger reconfigured in 30 minutes [11:12:41] *don't [11:12:47] <_joe_> paravoid: ideally we write a script for notifications that has the intelligence to apply some rules to decide if to page or not to page [11:12:55] there is no really need to run puppet after switch [11:13:16] <_joe_> but doing that in a way that's not horribly hacky requires some thought [11:13:22] so as a short term solution, I think it is ok [11:13:43] I think it is not ideal for maintenance host [11:13:48] as it is way more important [11:13:53] how about keeping these checks as-is and making them non-pageable, and having a higher-level check that's more intelligent and pages? [11:13:55] and definitely needs a puppet patch [11:14:03] paravoid: that is the end goal [11:14:15] but not ok at the moment [11:14:38] what's the impediment for something like this right now? [11:14:39] we need a better loadbalancer [11:14:47] and fix the SPOF mw bugs [11:15:02] so a querable pooling state [11:15:08] which is what joe has been working on [11:15:25] perfect is the enemy of the good :P [11:15:26] and mw reliability fixies [11:15:37] let's start somewhere! [11:15:44] it is not perfect- in the ideal world is the "normal state" [11:15:48] service monitoring [11:15:53] not host monitoring :-) [11:15:59] specially for pages [11:19:07] can I ask one of you to file a task about all this? [11:19:16] again, think that the issue is for the general ideal [11:19:23] these are just 2 examples [11:19:26] with a description of where we want to be, and what we can do in the meantime [11:19:29] sure [11:19:34] it may even exist already [11:19:56] the best we had was that mw_primary task and apparently this was misunderstood by at least me [11:20:24] but maybe we have another task that I forgot about :) [11:20:42] well, that fixed one thing- manually changing stuff on dc failover [11:20:48] but there are many things to fix :-) [11:21:39] this is a tracking of all things filed for dbs https://phabricator.wikimedia.org/T172492 [11:22:04] paravoid: probably https://phabricator.wikimedia.org/T177782 [11:22:10] is what you are looking for [11:22:28] which depends on offloading mediawiki db loadblaancer to etcd [11:22:59] note we are on -databases [11:23:23] which means I am not taking on other monitoring issues, about alerting [11:23:45] but I belive many phylosophy and tehcnical issues apply to other parts [11:23:45] yeah, I can see that eventually, I'm wondering if there are stopgaps we can take before that happens (and before we replace Icinga too -- which we really should) [11:24:07] that is another thing, are we postponing icinga replacing? [11:24:20] we have postponed it for more than a year, yeah :) [11:24:20] I don't know and don't have a take on that- it is scary [11:24:36] but new tools may also give other ways [11:24:36] by that I mean we were talking about replacing it... 1½ ago? [11:25:04] it's on our roadmap for this FY, but I doubt it'll happen (or at least be completed) e.g. next quarter [11:25:09] paravoid: I would be happy just by informing of all issues [11:25:19] which I may realize you may not know all the info [11:25:49] so you have a better picture or current chanenges [11:26:02] (which I think is what you are doing by asking, and I am happy you are) [11:26:14] yeah, and there are lots of moving parts [11:26:32] I don't think we've ever discussed the need for dynamically adjusting recipient groups based on etcd state before, for instance [11:26:52] that's an interesting idea, and I don't think that if we designed an icinga replacement today we would had thought of that :) [11:27:14] this is a small thing [11:27:26] but in general, dyanmic querying and configuration is hard [11:27:38] it feels to me like this is a "death by thousand cuts" thing, am I wrong? [11:27:48] i.e. that there are tons of "small things" involved here [11:28:01] well, I am the "stateful guy" [11:28:03] but in total it's a big endeavour and that's why it hasn't happened [11:28:16] we have issues other people don't [11:28:18] :-) [11:28:51] e.g. data is way more dynamic, and I am pushing for a more cloudy approach to configuration [11:29:07] cloudy without actual cloud- think dynamic [11:29:18] I agree with that approach fwiw -- I'd guess everyone here does too [11:29:20] and puppet and icinga are the opposite of that :-) [11:29:29] very static tools [11:29:42] puppet is, icinga... is to some extent [11:30:15] a lot of what you want to do can be done with icinga, just with hacks like duplicating checks and stuff [11:30:17] for example, graphite was more dynamic to an extend [11:30:25] you could just starte sending stuff [11:30:37] which leads ofc to other issues [11:30:44] and tbh I think we should do some of that now, rather than wait for an eventual replacement [11:30:55] if anything, it will inform us of the requirements a new system should cover [11:30:57] paravoid: thinking of global monitoring? [11:31:10] yeah [11:31:13] service-level mortoring? [11:31:33] well, ideally, yeah [11:31:49] but could start with better-informed host-level monitoring in e.g. additional checks [11:31:52] offtopic, also, in a way, kubernetes is also a run towards more dynamic option for deployments and monitoring [11:32:07] paravoid: both are being worked on :-) [11:32:25] the only thing it is not in my had, is fixing mediawiki and or PHP :-) [11:32:28] *hand [11:32:40] but maybe php7 will have some things already fixed for free [11:32:57] if you go to https://phabricator.wikimedia.org/T172492 [11:33:05] you can see all things that are prepared [11:33:10] or in work [11:33:22] in fact, literally now I am working on monitoring db backups [11:33:23] yeah I saw [11:33:33] it is part of the goal [11:33:42] but then dc failover happens, and other stuff :-) [11:33:59] I hear you, but I'd like to be cautious about interdependencies with stuff like replacing our monitoring system and php7 [11:34:08] let's make do with what we have and iterate, I'd say [11:34:13] there is no dependency on that [11:34:39] but there is dependency on mw stability of dbs and drivers doing sensible things :-) [11:35:01] remember when you did network maintenance onece and servers started to missbehave? [11:35:13] that is NOT supposed to happen, but it did [11:35:27] honestly, I don't think the existence of monitoring is depending on a system's stability :) [11:35:43] not the existance, as I said, I am working on that [11:35:49] but this other bug [11:36:01] https://phabricator.wikimedia.org/T119626 [11:36:13] this is db #1 bug, like ubuntu used to be [11:36:26] I know I know [11:36:37] :-) [11:36:45] I remember this task [11:36:48] let's talk more [11:36:51] at another time [11:36:54] about monitoring [11:37:02] and dbs and what things you can suggest [11:37:02] alright [11:37:11] and what I would need from monitoring/foundation [11:37:19] goal planning is coming soon, maybe we can think of doing some of that with the goal framework [11:37:31] within* [11:37:51] the thing is whatever you see week and not being worked on, we probly saw it a long time ago :-) [11:38:21] but either we had no time, or had blockers outside of our team [13:45:40] 10DBA, 10Toolforge, 10Privacy, 10Security, 10Vuln-Infoleak: Some Labs DB user_properties view fields are sensitive - https://phabricator.wikimedia.org/T150679 (10Bawolff) [14:40:32] I think I will disable puppet tomorrow and deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449711/ slowly and test on a eqiad and codfw host manually [14:40:37] if there are no objections [14:45:43] looks good to me [14:45:47] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Papaul) a:05Papaul>03Marostegui Disk replaced [14:46:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) Thanks! I will close this once it has finished correctly [15:14:00] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Marostegui) [15:31:21] https://jira.mariadb.org/browse/MDEV-13333 [15:31:51] https://jira.mariadb.org/browse/MDEV-16647 [15:32:32] is 10.1.36 out? [15:32:48] https://jira.mariadb.org/browse/MDEV-13333 -> looks scary [15:33:07] From a comment: I've just checked, 10.1.36 is on the roadmap with the release date 2018-09-14 [15:33:17] yeah, I am compiling right now [15:33:31] right on time for the DC failover! \o/ [16:15:15] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) 05Open>03Resolved All good! Thank you ``` root@db2053:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0BF0) Port Name: 1I Port Na... [16:15:28] 10DBA, 10Patch-For-Review: Gather statistics about the backups on a database - https://phabricator.wikimedia.org/T198987 (10jcrespo) This is done, I just need to explain and document what was done. [16:18:05] 10DBA: Create Icinga alerts on backup generation failure - https://phabricator.wikimedia.org/T203969 (10jcrespo) p:05Triage>03Normal [16:29:57] 10DBA, 10Analytics, 10Growth-Team, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10Nuria) Let's 1) stop purging 2) drop all echo tables on events and events sanitized database 3) start purging again [16:37:32] jynus, marostegui: detailed output updated in https://phabricator.wikimedia.org/P7519 for the switchdc steps [16:37:44] both output and logs, the logs details also the specific command executed [16:40:57] volans: thanks a lot [16:41:02] I will definitely check those [16:41:07] thanks! [16:41:14] I trust the commands [16:41:22] I don't 100% trust the host discovery [16:41:30] thank you a lot [16:42:32] volans: the good news is that removing gtid will mean everthing works all the time [16:42:40] as it should from the beginning [16:43:30] yeah! [16:44:31] volans: not for now, but is there some criteria to add script to spicerack vs standalone recipes? [16:44:54] I saw some people wanting to do that, but I don't know what is the "phylosopy" [16:45:21] jynus: sure, I'll reply after the meeting [16:48:20] don't need to do now, we can do it other week [16:59:34] jynus: so ideally all the cookbooks should be simple enough and the main logic be in the spicerack library, properly coded and tested. That said I see cases in which we want something more complex on the cookbook side just because it's specific or non-reusable, and that's fine too [17:00:09] it's not clear at this time if we'll need also "libraries" in the cookbooks repo that are not generic enough to make the spicerack but still need to be re-used across more than 1 cookbook [17:00:21] personally I'm leaning towards trying to avoid this 3rd layer at first [17:00:38] and see where we're going and what challenges we face [17:00:50] so, nothing yet written in stone [17:01:43] to make concrete examples, some "modules" I envision in the spicreack library are: icinga, elasticsearch (already WIP), ICMP, puppet, mysql, etc... [17:03:13] let me know your opinion and ideas :) [17:06:22] "spicerack library" ? [17:06:36] my question is, what does it provide? [17:07:54] our abstraction to do stuff on our infrastructure, so there is a dnsdisc module that manage discovery DNS records and their TTL for example [17:08:09] there are lower-level modules for confctl and cumin [17:08:38] what I mean is, I can already use (in fact I am using( cumin library [17:09:07] what question should I ask myself about when to create a "recipe" or a "library" [17:09:17] 10DBA, 10Analytics, 10Growth-Team, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10mforns) > Let's 1) stop purging 2) drop all echo tables on events and events sanitized database 3) start purging again Makes sense. Also MySQL right, or are those alr... [17:09:21] vs a standalone script or another library [17:09:27] is there any design goals [17:09:38] not written in some docs, not yet [17:09:53] I am not asking for those [17:10:06] I just genunly don't understand what is the goal [17:10:23] not disagreing, just don't understand it yet [17:10:59] will wmf-reimage be inserted there? [17:11:05] yes, totally [17:11:16] the wmf-reimage-lib will be split into different modules and made available there [17:11:18] so, what is the criteria [17:11:25] and the reimage script will be some cookbook [17:11:31] the criteria, as I see it is: [17:11:32] if it is python and is wmf, it goes there? [17:11:54] yes, I want to understand your vision [17:12:00] if it's something re-usable goes into the library, if it's something super specific to a single task goes into the cookbook [17:12:03] even if we were not yet there [17:12:23] if it's something generic enough, it's a software on its own that will be made available also through spicerack [17:12:45] but only for maintenance [17:12:54] no api/deamo style stuff? [17:12:55] so conftool and cumin will keep being on their own, but made available in spicerack with a wrapper [17:12:59] that makes them easy to use [17:13:12] recurring maintenance? [17:13:30] I am throwing things to understand if those would be part of it or not [17:13:37] nothing forbid to add the executin of a cookbook to a crontab or systemd timer [17:13:52] but so far no daemon was planned [17:13:53] what about demon/service [17:13:54] ok [17:14:01] so all "one off" [17:14:02] again, I am not asking [17:14:08] sure sure [17:14:10] for that, just really trying to understand [17:14:12] no prob [17:14:18] because for me it is like an "all in" [17:14:26] and that is scary [17:14:44] definitely not an all-in [17:14:50] at most a wrapper to all things [17:14:55] ok [17:14:57] to make easy to write stuff [17:15:00] so with that [17:15:02] so for example [17:15:07] I guess that if I do a new service [17:15:13] it will be on a separate code base [17:15:22] and I would just put there maintenance/change operations [17:15:27] to call it [17:15:33] is that kinda? [17:15:40] yes [17:15:43] let's say the backups [17:16:03] you might have daemons that manage those backups [17:16:09] and those will not be in spicerack [17:16:59] but you might have a module for backups in spicerack that is a wrapper or client for your service [17:17:13] ok, I think I am starting to get it [17:17:14] and allow to write cookbooks to query or perform maintenance on the backups [17:17:36] my recommendation, not nonw [17:17:47] but put a vision statement somewhere asap [17:18:09] or you will have to start problems to understand how you want to evolve that [17:18:12] not you [17:18:14] other people [17:18:25] totally agree [17:18:32] I only learn about this due to dc [17:18:42] it's a good suggestion [17:18:43] and didn't fully understand the goal [17:19:17] I was planning to do it afterwards, didn't expect so much enthusiast to start using it already [17:19:23] *enthusiasm [17:19:29] well, people wanted to add stuff [17:19:36] so I asked what is this? [17:20:00] I was expecting to have to publicize it a bit and do some "selling", not the opposite :D [17:20:25] it's even too early IMHO to start adding too much stuff, need to polish some things and finalize the client interface [17:20:42] well, the mysql module is worthless [17:20:46] as it is now [17:20:53] or maybe it is the dcswitch>mysql [17:20:56] it has a big TODO on top [17:21:09] I have been working on my own api for that [17:21:12] you know that [17:21:20] and may be stable after the second rewrite [17:21:33] and has tests and all [17:21:40] surprisingly [17:21:59] 10DBA, 10Analytics, 10Growth-Team, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10Nuria) Looks all tables in mySQL db also need to be deleted. [17:22:03] it hurts to see the cumin mysql TRIPLE QUOTE [17:24:48] I think I mentioned in all the CRs and the code, the mysql module is a porting of the old switchdc one for time-constraint, it should be replaced by a native mysql-speaking client and possibly do stuff either in parallel or async, if possible using/integrating what you've been working on [17:25:13] yes I know [17:25:41] note what I am doing is not a glue for general mysql quering, but only for admin/root stuff [17:26:17] and that's what we need in spicerack mostly :D [17:26:22] root/admin stuff [17:26:34] well, you should have started there [17:26:43] and in particular I had modules for the stuff you mentioned, which I needed to do for master failover [18:15:43] 10DBA, 10AbuseFilter, 10PostgreSQL: Joins on INTEGER and TEXT fail with PostgreSQL - https://phabricator.wikimedia.org/T42757 (10Daimona) Right now we have a total of [[https://phabricator.wikimedia.org/diffusion/EABF/browse/master/?grep=af_id+*%3D+*afl_filter|five cases]] where such a JOIN is done. Schema c... [19:12:14] 10DBA, 10AbuseFilter, 10PostgreSQL: Joins on INTEGER and TEXT fail with PostgreSQL - https://phabricator.wikimedia.org/T42757 (10Daimona) @Marostegui Unfortunately, neither do I :-) However, PG is just the cause here. What we want to do (and I'm asking DBAs about) is perform a schema change on abuse_filter_a...