[08:15:24] I have written my part for the Monday document on our etherpad on the normal == Databases == section [08:55:59] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10jcrespo) We already have sizes of all uncompressed and compressed tables on zarcillo, those are planned to be shown in a dashboard.... [09:14:57] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10ArielGlenn) @jcrespo I want specifically slot count by type, and revision count, for commons. Eventually I will want the same for o... [09:25:21] I am going to deploy the secret in preparation for the new prometheus thing [09:25:28] ok [09:30:42] done [09:32:45] I would like to have you around when I deploy for real [09:32:54] in case something breaks [09:33:28] sure [09:33:32] you want to do it now? [09:33:44] it will take me a few minutes, but yeah [09:33:48] sure [09:33:49] to prepare everything [09:33:52] cool [09:36:25] https://gerrit.wikimedia.org/r/c/labs/private/+/521839 [09:50:49] so this is the plan: disable puppet on 3 out of 4 promeheus servers [09:51:00] copy locally the old configuration [09:51:05] on one server [09:51:09] enable puppet there [09:51:20] let puppet run, see what happens, compare with the old oncifg [09:51:26] sounds sane yeah [09:51:27] run puppet several times [09:51:40] revert and recover config if something is strange [09:51:41] it doesn't matter if you pick eqiad or codfw really, no? [09:51:49] maybe also test with zarcillo/account down [09:52:26] sure [09:52:34] account handling is not puppetized [09:52:39] but that needs more conversation [09:52:57] but it is merged [09:53:02] good [09:53:03] tested the script from cumin [09:53:30] doing on ops [09:53:35] ok [09:53:36] any preference for a server? [09:53:50] not really :) [09:53:51] maybe codfw [09:54:10] let's say prometheus2003 as the one to run first [09:54:13] cool [09:59:13] I have deployed [09:59:20] yep I see [09:59:22] logging in on p2003 [10:00:04] so time to manually run puppet? [10:00:28] I am coping all the file to my home [10:00:30] first [10:00:38] cool [10:00:55] in case we delete them (not really that usefule, as them, being generated by puppet) [10:01:03] but we will be able to diff later [10:01:47] and now running puppet and we will see [10:01:57] I am tailing the puppet log :) [10:02:34] there is always going bad- grants, typos or something [10:03:12] ups, it failed? [10:03:16] Error 500 on SERVER: Server Error: no parameter named 'requires' at [10:03:20] lol [10:03:22] yeah, saw the logs [10:03:38] I don't know puppet anymore [10:03:52] quick patch [10:04:04] not worth reverting and reapplying [10:04:07] for that [10:04:08] no [10:06:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/521845 [10:07:57] what is the CI error? [10:08:16] 10:06:42 modules/profile/manifests/prometheus/ops.pp:1239 ERROR two-space soft tabs not used (2sp_soft_tabs) [10:08:22] thanks [10:13:21] CI looking good now [10:13:55] we have an issue, which is [10:14:06] the module is being used on non-prometheus role hosts [10:14:43] which hosts? [10:15:30] bastions on non-db dcs [10:15:40] ah.. [10:16:54] same error? [10:16:56] on puppet? [10:17:07] ah, it just ran because it was enabled, nevermind [10:18:30] so I cannot rebase nor submit, I need to rebase locally [10:18:33] all great [10:19:20] maybe it was already merged when you removed your +2 or something? [10:19:28] no [10:19:55] apparently it cannot resolve automatically, even if vguttierez patch has noting to do [10:22:21] ImportError: No module named 'pymysql' [10:22:34] yep [10:36:52] deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/521847 ok? [10:37:03] ok [10:38:06] sorry, too many moving parts [10:38:44] which shows that stopping puppet and trying on a host was a good idea! [10:39:08] I will need to leave in around 20 minutes, otherwise I won't be in time for lunch+interview prep+interview :( [10:39:13] it is ok [10:39:17] hopefully it works now [10:39:32] if not, I will revert, try at another time [10:39:39] * marostegui crossing fingers [10:40:18] I think it run before proper deploy [10:40:25] as it didn't install anything [10:40:38] there was a run at 10:39 [10:40:51] it should install it now [10:40:59] jynus: there's no python3-pymysql in jessie [10:41:07] oh [10:41:11] :( [10:41:11] but this only applies to Stretch-based DB hosts? [10:41:36] 'We could not connect to db1115.eqiad.wmnet to store the stats' [10:41:55] moritzm: it should not be run on bastions [10:42:08] but I cannot change the current monitoring system on my own [10:43:04] telnet from prometheus2003 to db1115 on 3306 works [10:43:07] so I guess grants issue? [10:43:15] root@cumin1001:~$ pt-show-grants h=db1115.eqiad.wmnet,F=/root/.my.cnf | grep 10.192.0.145 [10:43:18] the grant is there [10:43:31] ah, yes it's from profile::prometheus::ops which applies to bast3002, which is in fact still jessie until esams gets rebuilt [10:44:16] /etc/mysql doesn't exist [10:44:22] ha [10:44:39] I will try to make it work for debugging porpuses [10:44:47] and then revery everything [10:44:51] rethink it [10:45:05] why does it fail if /Etc/mysql isn't there? [10:45:14] it may need a different approach [10:45:16] "Access denied for user 'prometheus-mysqld-exporter'@'10.192.0.145' (using password: NO)") [10:45:23] right [10:45:24] got it [10:46:21] the password is empty [10:46:26] it is not going through [10:46:38] so the secrets wasn't picked up? [10:47:57] it works if I run it manually [10:48:03] with the secret in place [10:49:23] :-/ [10:49:33] I am going to copy the generated files for analysis [10:49:37] then revert everything [10:49:39] ok [10:49:43] I am going to head for lunch I think [10:56:06] things should be back to normal [10:56:26] there will be leftover packages on prometheus2003, but no worries [10:56:32] as those will be installed back soon [11:06:48] I've made a proposal to move forward at https://gerrit.wikimedia.org/r/c/operations/puppet/+/521852 [11:07:52] also when you come back, I have stored the old files at prometheus2003:/home/jynus/targets (generated by puppet) and /home/jynus (generated by the script) [12:20:05] and the data is consistent? [12:23:13] I am having lunch now, but it is difficult to say [12:35:16] I will need to spend some more time later checking [12:35:26] because weird prometheus format [12:35:40] are you leaving at 16 CEST? [12:35:54] or earlier? [12:35:59] marostegui^ [12:36:23] I finish the interview at 16:00 CEST and I will probably leave right after it, yeah [12:36:42] ok, then anything pending? [12:36:48] nope [12:37:04] What I told you earlier are the only things [12:37:16] compression is running from cumin1001? [12:37:26] yep [12:37:39] it has been running for a couple of days already with no impact [12:37:42] so I think it is safe to be left [16:45:02] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10ArielGlenn) Note that I don't need daily reports, weekly or even monthly would be good enough. But if there is an easy way to just... [22:31:28] 10DBA, 10Reading-Infrastructure-Team-Backlog, 10Release-Engineering-Team-TODO: Drop DB tables for now-deleted zerowiki from production - https://phabricator.wikimedia.org/T227717 (10Jdforrester-WMF) [22:35:02] 10DBA, 10Reading-Infrastructure-Team-Backlog, 10Release-Engineering-Team-TODO: Drop DB tables for now-deleted zerowiki from production - https://phabricator.wikimedia.org/T227717 (10Jdforrester-WMF) a:05Jdforrester-WMF→03None