[00:16:55] 10DBA, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Patch-For-Review: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3384856 (10tstarling) >>! In T167784#3382056, @jcrespo wrote: > I do not think this is resolved, but ongoing (based on disk space trends). The disk us... [05:23:58] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3385048 (10Marostegui) [05:48:01] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3385061 (10Marostegui) I have rebooted db1034 and it came back fine, so did MySQL. RAID is fine and HW logs only show some disk errors from a mon... [07:25:04] 10DBA, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Patch-For-Review: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3385144 (10jcrespo) a:05tstarling>03None Note the current active parsercache on eqiad is db1096 (none of the pc* hosts are pooled there) which star... [07:30:01] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3385167 (10jcrespo) Enclosure Device ID: 32 Slot Number: 5 Drive's position: DiskGroup: 0, Span: 2, Arm: 1 Enclosure position: N/A Device Id: 5 W... [07:32:34] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3385169 (10Marostegui) However there is another disk having errors, and not that one: ``` root@db1034:~# megacli -PDList -aALL | grep Error Media... [07:35:35] 10DBA, 10Patch-For-Review: dbstore2001: s5 thread isn't able to catch up with the master - https://phabricator.wikimedia.org/T168354#3385206 (10Marostegui) I have restored the original replication filters for x1 after reimporting the missing tables from x1 into dewiki and wikidatawiki [07:52:43] I am deploying the puppet refactoring now [07:53:41] jynus: are the binlog_format: MIXED -> ROW changes expected in this refactor? [07:55:35] yes and no [07:55:54] they are ok, because there are no reason for them to be in any way [07:56:07] only the STATEMENT are important [07:56:27] I have not made up my mind of what is the right state now (and if they change is not a problem) [07:56:40] but they also do not aply on puppet (needs reboot) [07:56:51] yes, that I know ;) [07:57:00] so I prefered to deploy and if we change our mind [07:57:06] do it easier on hiera [07:57:26] it is a complicated topic [07:57:43] because I want to push for row, and codfw apparently was already using row mostly [07:58:04] but we may need additional hosts on STATEMENT, not mixed [07:58:16] for failover reasons until current labsdb are gone [07:58:30] ok, right [07:58:32] so the answer is "it is ok" [07:58:37] we can change it later [07:58:42] this is not supposed to be set on stone [07:58:46] it is only a refactoring [07:58:59] I expect to fine-tune later [07:59:25] but we need to embrace hiera first for fine-tuning [07:59:37] yeah, I don't like the addition of a hiera file per host, but I didn't follow the discussion so might perfectly because of puppet, and don't want to bother you while deploying it ;) [07:59:39] I will be monitoring the changes for the masters, which are the most delicated [07:59:57] well, the hiera per host is supposed to be temporary [08:00:05] but more in line with the coding style [08:00:14] shards will go away with etcd [08:00:26] and sockets are temporarily on /tmp [08:00:39] but they should slowly flow towards the default [08:00:51] ok [08:00:59] if you do not agree with that change, I was told we should migrate by my manager [08:07:55] I'm not saying it shouldn't have been refactored, just that the majority of the hiera files have the same content, so maybe an approach with one profile with parameters (basically the old class) and one role per shard might have been cleaner, requiring hiera files only for the exceptions. [08:09:59] no [08:10:07] we need to be able to delete it one by one [08:10:17] all the current hosts are exceptions [08:10:43] in the end, no hiera files should be there [08:11:36] plus where do you get the parameters from? [08:12:20] are you going to create role::mariadb::s3::master::old_tmp and all the combinations? [08:13:01] no, just role::mariadb::s3, and keep the hiera files for the ones that don't have only the socket and shard defined in hiera [08:13:24] right now that is all hosts [08:13:31] all hosts are on /tmp [08:13:35] and that should not be the default [08:13:44] this is a migration we have here [08:14:26] also, what is the difference in functionality between role::mariadb::s3 and role::mariadb::s4? [08:14:33] there is none [08:15:06] all mysql servers (core, we understand) should be exactly the same [08:15:35] with no puppet change on change of shard, or role [08:20:39] <_joe_> what is the info about shards used for in puppet? [08:20:51] put a banner [08:20:57] on login [08:21:00] I don't really agree given that we've made custom tuning of specific shards parameters IIRC because of their specific data/usage. But that's off topic being a more general discussion on the my.cnf, puppet and how to keep them in sync [08:21:08] I think I put it on the mysql prompt, too [08:23:40] shards on puppet are currently a hack [08:23:43] <_joe_> ok so it is changing things indeed. Also allows you to select hosts in cumin [08:24:06] and it should be moved to etcd [08:24:12] not because it is dynamic [08:24:33] because it should not contradict pooling information there [08:25:23] so eventually, puppet should not know about shard or role [08:25:48] <_joe_> anyways, my original advice was to move role::mariadb::core => profile::mariadb::core or something similar, create one role per shard, and define the shard property of the profile via hiera, but just for the role. Other things like master = true or similar can go in per-host hiera then. But tbh, I think this is better discussed on a ticket where the long-term vision for puppet+mariadb at the [08:25:54] <_joe_> wmf is spelled out [08:25:54] <_joe_> do we have such a ticket? [08:26:09] <_joe_> if not, I'd ask you people to create one :) [08:26:26] this is not this patch [08:28:20] old puppet style was not good and new is not good either [08:28:31] this is not a fix for vision or style [08:28:38] this is a fix for stretch [08:28:53] I think it will make later fixes easier [08:28:59] but we are not yet there [08:29:36] basically, this is a fix for stretch, systemd and mariadb 10.1 [08:31:38] volans: in your method, are you going to have roles for core_s1_and_s2 ? [08:31:56] and each 2 and 3 and 8 combination? [08:32:05] why? [08:32:14] because we are going to deploy that [08:32:56] profiles, you can have profile(s1) && profile(s2) [08:33:11] but please tell me how you are going to apply that without parameters [08:34:02] are you referring to the dbstores/analytics and such? [08:34:06] no [08:34:13] production core hosts [08:34:15] mediawiki [08:34:33] we are going to have every combination for HA purposes [08:35:00] every database is almost unique in content, and puppet should not control that- puppet should only know it is a core database, or a misc database, or a parsercache, or an external store [08:35:06] at most [08:37:40] I was referring to the current shards divisions, if you're going to change all that I'm not aware, so cannot consider it [08:37:59] well, you are not part of the DBA discussions [08:38:12] I invited you [08:38:45] if you want to know our latest agreements and pain points, etc [08:38:55] it is not set into stone [08:39:03] but there are tickets about it [08:39:26] https://phabricator.wikimedia.org/T159423 [08:40:09] https://gerrit.wikimedia.org/r/#/c/338996/1/wmf-config/db-eqiad.php [08:40:36] check the db1096:3312-like hosts [08:40:59] is the current puppet code perfect in every way? [08:41:02] of course not [08:41:07] this is just an iteration [08:41:39] but I need this to unblock basic stuff, and we can continue refactoring later [08:42:18] I cannot wait for eternal discussions [08:42:33] in the more geneal sense, if you want to support a more dynamic setup then you probably want the my.cnf to be generated partially from puppet with some very generic global configuration and partially from something else (not etcd IMHO) that will handle this complexity, and will take care of keeping the on-file and live configurations in sync allowing you to decide how and when to rollout those [08:42:39] changes across the hosts [08:42:50] ok ok ok [08:42:54] that is out of scope here [08:42:58] as joe said [08:43:31] here and now I need stretch suppor and socket and basedir configuration [08:43:43] this will give it to me now [08:43:54] fair enough [08:43:55] and I am very open to any other patch later [08:43:59] patch [08:44:05] enphasis there :-) [08:44:48] :) [08:45:16] do you think I am 100% happy with this patch? [08:46:07] I'm never happy with puppet patches ;) it always force me to do something I wouldn't have done :-P [08:46:20] if you want to help, please help me solve this concrete problem: https://gerrit.wikimedia.org/r/#/c/361824/ [08:46:44] I want to keep the basedir option, because it is the right one for the templates [08:46:57] but I cannot write to the parameter [08:47:10] I do not want to rename the parameter either [08:48:05] maybe changing the default from '' to undef might do the trick, not sure but is something I would try [08:48:59] because variables are immutable in puppet but if it's undef shuld allow you to define it [08:49:17] ok, I can try that [08:49:59] how do I check for undef ? [08:50:24] I would assume == undef would not work? [08:52:50] no you can do that :D [08:53:00] we use it in many places [08:53:20] what about if $basedir { [08:55:51] if you want to cover the case if someone passes an empty string, I'm afraid it might not work if puppet truthiness is like ruby's one '' is true [08:58:31] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3385410 (10Marostegui) Interesting that the errors have not increased for the disk reported on: T166208#3385167 maybe those were generated at: 20... [08:59:00] actually, I want to skipp the empty string [08:59:11] because it would mean the variable is setup [08:59:26] I want to initialize the variable if it is not initialized manually [09:00:08] why you don't want to use if $foo == undef ? [09:00:27] I asked if that was legal [09:00:40] yes, we use it all over the place [09:01:00] sorry, I said "no you can do that :D" that was confusing [09:01:36] was actually affirtimave, yes it works [09:01:38] my bad [09:10:33] what is the error? [09:10:56] o [09:10:58] order [09:11:13] optional parameter listed before required parameter (parameter_order) [09:11:17] yeah :( [09:11:29] one test is unclear [09:11:33] if the parsing fails [09:11:53] * volans wondering if undef makes the parameter required though [09:14:09] I am noting that meanwhile, all database hosts have puppet disabled [09:20:11] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3385536 (10jcrespo) My report came after the RAID check timed out, which made me worry- check history of alerts on icinga. [09:22:39] "Error 400 on SERVER: Cannot reassign variable basedir at /etc/puppet/modules/mariadb/manifests/service.pp:25 on node db2016.codfw.wmnet" [09:22:41] ideas? [09:25:17] jynus: seems that I'm wrong and you cannot re-assign either if it's undef, I'm looking at the existing code and so far in alla cases another variable is used :( [09:25:35] I though you would have tried on a puppet compiler first :) [09:26:00] s/would have tried/would try/ [09:26:53] I say again that I have 150 hosts with pupper disabled and potentially broken [09:31:12] although there is one case but is not in a class, and I guess that tricked me [09:31:25] in realm.pp $realm [09:33:23] your solution doesn't work either [09:33:48] grep basedir /opt/wmf-mariadb10/service [09:33:48] which one? [09:33:51] basedir="" [09:34:07] Notice: /Stage[main]/Mariadb::Service/File[/etc/init.d/mysql]/target: target changed '/opt/wmf-mariadb10/service' to '/service' [09:34:27] who knows, we may be even causing an outage on labsdns [09:34:36] file { "${basedir}/service": [09:34:44] instead of $initd_basedir [09:38:44] 10Blocked-on-schema-change, 10DBA: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3385590 (10Marostegui) I have been doing tests with MySQL 5.6 and it works perfectly fine there. I have updated the bug report with this info - still not reply from MariaDB. [09:38:52] https://gerrit.wikimedia.org/r/361838 [09:39:18] 10Blocked-on-schema-change, 10DBA: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3385591 (10jcrespo) They replied- they said it was due to the binary collation. [09:40:28] Ah, looks like I was using the same tab I had when I reported it and never refreshed [09:40:31] lame [09:40:54] it is still a bug [09:40:58] however it does work with binary on mysql [09:41:08] oh, I wasn't challenging that [09:41:15] just the "no response yet" [09:41:17] if it is "expected" they should specify it on the doc [09:41:19] yeah yeah [09:41:20] :) [09:41:31] I tested mysqlf too on 5.7 [09:41:40] *myself on mysql [09:42:47] awful :( [09:43:26] jynus: re:gerrit/361838 LGTM, sorry I'm late [09:43:36] Notice: /Stage[main]/Mariadb::Service/File[/etc/init.d/mysql]/target: target changed '/service' to '/opt/wmf-mariadb10/service' [09:44:01] I was checking ig the $basedir in the various erb files where coming from that one or not too [09:44:12] no, it comes from cofig [09:44:24] but it was pending, too [09:44:35] there is a lot of things pending [09:44:45] for example, other hosts hardcode the socket still [09:45:00] I said to deploy this and continue working on a separate commit [09:45:44] now, services that run automatically (enabled=true) [09:45:51] maybe completely broken [09:46:07] not our fault, though, we warned about not to use that functionalty [09:46:16] and that we were not going to support it [09:47:31] now /etc/init.d/mysql status fails on labsdb1010 [09:48:08] I think it defaults to mariadb10 config, maybe? [09:48:36] labsdb1001 works [09:48:49] those are different versions, no? [09:49:05] yes, I suspect it doesn't get the new basedir [09:49:42] we have to find the manifests [09:49:52] because they are not under the mariadb role [09:50:02] I wonder if we should leave it broken? [09:51:30] maybe the socket needs changing, too [09:54:40] so this will solve something: https://gerrit.wikimedia.org/r/#/c/361838/ [09:54:51] I am not sure it will solve the problem [09:55:14] basedir in init.d is still /usr/local [09:57:28] are we sure the service code works? [10:04:20] I do not think it works, if it worked, basedir would be set to the right place according to the package [10:06:33] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3385704 (10Marostegui) Unfortunately we do not have any history from yesterday, but the host was indeed having issues as can be seen on the log u... [10:07:33] Marstegui- so my fears of maybe not pool that back were more or less funded? [10:07:43] totally [10:07:49] I want to see if it finishes the alters [10:07:54] ok, do not spend too much time on that [10:07:59] There big one is yet to come (altering a 140G table) [10:08:08] just wanted to notify and move on [10:08:12] (we) [10:08:31] nah, I will leave it running as it still probably needs more than 20h to finish all the pending alters [10:08:34] we will see how it behaves [10:11:18] move db1035 issues to a separate ticket if you are going to do more things [10:11:26] enphasis on IF [10:11:33] 10DBA: dbstore1001 mysql crashed - https://phabricator.wikimedia.org/T169050#3385724 (10Marostegui) [10:11:55] so we keep the alter one ontopic [10:12:16] jynus: will do :) [10:12:22] (don't think I will) [10:12:23] dbstore1001 was doing the backups right now [10:12:54] So is it "normal"? Never saw it crashing when doing backups, but maybe you have :) [10:13:11] no [10:13:14] it is not normal at all [10:13:18] ok :) [10:14:20] "InnoDB: We intentionally crash the server, because it appears to be hung." [10:14:41] yep [10:15:06] I just pointed out as a potential direct or indirect cause [10:15:13] it could be the service stuff [10:15:21] but nor on normal, production hosts [10:15:33] it seems none of those [10:16:03] Nah, it started complaining: InnoDB: Warning: a long semaphore wait [10:18:05] https://bugs.mysql.com/bug.php?id=44841 [10:18:15] [8 Sep 2009 20:32] James Day [10:18:16] Nickolay, it's just a symptom of heavy load. It's one of the ways that extreme heavy disk-bound load shows up. In this case the main symptom is the data dictionary operations that involve waiting for disk I/O.dict/dict0dict.c and row/row0mysql.c are those. InnoDB doesn't expect to wait hundreds of seconds for disk I/O so when it does the watchdog process notices and starts printing diagnostic output. Sometimes the wait gets so l [10:18:33] interestingly this is the same thing db1034 had before crashing too [10:20:00] and we would have no problem with it [10:20:10] except that without gtid, it breaks the server [10:21:08] can you give a look at https://gerrit.wikimedia.org/r/#/c/361841/5/modules/mariadb/manifests/service.pp [10:21:18] it doesn't work in either way [10:22:58] 10DBA: dbstore1001 mysql crashed - https://phabricator.wikimedia.org/T169050#3385767 (10Marostegui) There are a bunch of: ``` InnoDB: Warning: a long semaphore wait: --Thread 139834895070976 has waited at dict0stats.cc line 2406 for 241.00 seconds the semaphore: X-lock (wait_ex) on RW-latch at 0x144b040 '&dict_o... [10:23:14] let me check (although not sure if I am going to be able to help you much!) [10:23:26] undef and '' I think they area always false [10:23:30] unless I am missing something [10:25:26] it looks pretty straightforward to me that if :| [10:27:02] I don't get it [10:27:06] So what does it set it to? [10:27:22] ˜/jynus 11:55> basedir in init.d is still /usr/local -> that? [10:27:32] yes, it is empty [10:27:43] that is the deafult hardcoded on the init.d [10:29:52] or something is failing on execution [10:31:16] oh, I know what is happening [10:31:23] what is it? [10:31:54] there are 2 templates [10:31:59] mariadb.service [10:32:02] and mariadb.server [10:34:05] mariadb.service should be deelted (it is now on the package) [10:34:11] server was not changed [10:34:18] and it doesn't really need to change [10:34:30] beacause it can set it from my.cnf [10:36:18] so basedir was being set from another place? [10:37:04] it was not set [10:37:18] then I don't get it [10:37:25] the last patch should clarify it [10:37:29] let me see [10:37:38] it is easy to show than to explain [10:37:54] mariadb.service.erb goes away [10:38:12] mariadb.server.erb gets the right parameter [10:38:36] (although it is not needed, because it gets it from /etc/my.conf) [10:38:47] Aaaah I see [10:38:51] Now I see it [10:38:56] the other change is unneded [10:39:01] but I wanted to do it anyway [10:39:11] I checked the systemd unit is not used anywhere [10:39:20] It comes from the package on stretch [10:39:34] that should fix this particular issue finally [10:40:01] it is ok, because it forced to do more work than I initially intended to do [10:40:10] but it was going to be done anyway [10:40:39] good catch, i get it now :) [10:41:08] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3385799 (10Marostegui) [10:41:15] the problem comes that the init.d wants to be smary [10:41:23] and if it doesn't get a basedir, it doesn't fail [10:41:32] it tries to get it from config or somewhere else [10:41:37] so it was not clear aat first [10:42:14] +basedir="/opt/wmf-mariadb10" [10:42:31] \o/ [10:43:14] there is some weirdness on labsdb1010 [10:43:20] but I think that was there before [10:43:31] the important thing is that /etc/init.d/mysql status works [10:44:11] I have tested now 10 on jessie, 10 on trusty, 101 on jessie [10:44:20] I have left 101 on stretch [10:46:11] I will now restart db2072 and 62 [10:46:16] to test config from 0 [10:50:11] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3385815 (10Marostegui) Current status of replication: s6 is broken with duplicate key on jawiki.watchlist: `Error 'Duplicate entry '18049674' for key 'PRIMARY'' on query. Default database: '... [10:50:18] how can a replace into end up with duplicate entry? ^ [10:51:06] it is a transaction issue [10:51:17] (probably) [10:51:40] you were right :) [10:51:41] full transactions are applied- if fails on something like INSERT + REPLACE or something like that [10:51:51] stop slave start slave worked [11:04:28] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3385836 (10Marostegui) s6 was fixed by stopping/starting slave. [11:36:53] so I have checked and puppet does what it is supposed to do on both masters and slaves, stretch and jessie, including pt-heartbeat [11:37:03] I will do a faster enable later [12:31:53] 10Blocked-on-schema-change, 10DBA: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3386209 (10Marostegui) [12:49:02] marostegui: I do not see errors on puppet, do you see anythign strange? [12:49:27] nope, nothing [12:49:38] let me check dbstore [12:49:40] maybe I should do the master and others manually [12:49:59] I can do codfw masters if you like [12:50:01] I am working with them [12:50:04] and you can do eqiad [12:50:18] I already enabled all of codfw [12:50:25] Ah :) [12:50:37] I am hesistant for db1* ones [12:50:43] in case something bad happens [12:50:58] I changed the pt-heartbeat execution [12:51:06] let's leave them disabled till tomorrow morning [12:51:11] but I tested and the new on works [12:51:15] nah [12:51:22] if it works on codfw it should work there [12:51:30] as pt-hearbeat also runs there indendepntly [12:51:33] except on a new setup [12:51:49] the good things is that stretch is finally fixed [12:52:07] and there is one thing I have to stress to you on mysql restart [12:52:22] the idea now is that if you have to stop or restart mysql or the host [12:52:38] remove the socket: line from hiera first [12:52:38] I tried db2018 a noop run (s3 codfw master - where puppet is sitll disabled) and the changes it would have made look good [12:52:41] run puppet [12:53:03] is it clear what I mean? [12:53:10] yeo [12:53:11] yep [12:53:13] it is difficult to explain [12:53:21] basically, to migrate the socket place [12:53:26] https://gerrit.wikimedia.org/r/#/c/361456/15/hieradata/hosts/db1037.yaml [12:53:29] that, right? [12:53:58] stop, then remove from hiera, then run puppet, then restart, then [12:54:07] no [12:54:15] the shard must stay for now [12:54:23] No, I meant the socket line [12:54:34] oh, sorry [12:54:40] I read it backwards [12:55:10] keep the mariadb::shard: 's6', remove the other [12:55:21] and it will default to /run [12:55:30] awesome :) [12:55:31] will do [12:55:32] puppet will create the dir for you [12:55:41] it is not a high priority [12:56:01] but if you happen to restart a host, it is one step towards [12:56:09] Sure, I will do :) [12:56:17] T148507 [12:56:32] the shards will go away, too, at some point [12:56:56] so on grafana should only be strange things, like the host that require row [12:57:03] or temporary states [12:57:04] you want me to enable puppet on db2018 (s3 codfw master)? [12:57:10] s/grafana/hiera [12:57:18] is it not on? [12:57:25] The last Puppet run was at Wed Jun 28 07:24:54 UTC 2017 (316 minutes ago). Puppet is disabled. deployment of gerrit:361456 [12:57:30] mmm [12:57:33] let me see [12:57:45] are you sure? [12:57:55] the warning disappeared from icinga [12:57:57] well, that is the motd I saw 5 minutes ago :) [12:58:09] oh, don'tr trust motd [12:58:11] looks like it got enabled now [12:58:14] :-) [12:58:17] haha [12:58:44] only db1* hosts pending [12:58:56] I will run it manually on a few key hosts [12:59:00] ok :) [12:59:22] but check if something like the row_format is ok everywhere where you restart for some days [12:59:32] if you feel there is an error in some [12:59:47] the same thing applies: add it to hiera [12:59:54] ok :) [12:59:58] right now I set up ROW as default [13:00:10] When i did the code review I checked sanitarium2 masters and they looked ok [13:00:10] but that doesn't necesarily is the right thing [13:00:16] those worried me the most [13:00:28] yes, those are not controlled by hiera yet [13:00:33] I focused on core [13:00:52] then, if you see something strange just either tell [13:01:00] or change it yourself [13:01:07] will do! [13:01:12] thank you [13:01:14] thanks for all the refactoring, good step! [13:01:16] this was a large change [13:01:26] and now we can fine-tune [13:01:31] and change other stuff [13:04:38] indeed, thanks for working on it another thing we will be happy about in the future [13:08:03] for example, for the shards- we may want to setup a more global config [13:08:37] with a hierarchy, such as the one for prometheus [13:08:56] for me that is a minor issue [13:09:06] it is mostly formatting [13:09:47] there seems to be puppet errors [13:12:46] yes i am checking too [13:15:27] I think I may have missed some host to declare [13:16:02] which furthers my position that shards are not that important [13:16:04] They one I am seeing is declared (db2017) but yet fails [13:16:24] ah no [13:16:25] I don't see db2035 [13:16:26] it is noit [13:16:31] no no, db2017 isn't declared [13:16:35] must be that [13:16:36] how can I have missed those? [13:16:51] did I skip a shard? [13:17:03] botyh are s2 [13:17:05] let me check the others [13:17:12] I will check all of s2 [13:17:24] they are missing yep [13:17:44] yes, all s2 in codfw is missing [13:18:03] eqiad looks declared [13:22:07] https://gerrit.wikimedia.org/r/361857 [13:22:32] the master looks good which is what I care for :p [13:23:06] so, again, a wrong flag there shouldn't affect the host itself [13:23:18] except maybe the master, but that should not be in puppet anyway [13:23:56] yes yes [13:28:11] do you see why I reserved all of today to do this, right? [13:28:14] :-D [13:28:24] haha [13:28:27] Yeah [13:28:44] I will double check eqiad [13:28:53] Actually when you said it yesterday like: i will deploy tomorrow morning I was like: uh, this might be fun [13:28:54] and when the deployment finishes [13:28:57] I'm totally unsure if this is a legit gripe https://phabricator.wikimedia.org/T169038 [13:29:15] is that "as expected" or "oops" on our part? [13:29:16] sorrym gripe? [13:29:37] sorry, gripe is like complaint or grumbling about something relatively trivial [13:30:30] I have not yet fully understand the case, but it is legitimate, but not high priority [13:30:37] *looks [13:30:42] ok thanks [13:30:56] as in, sometimes I have to run ANALYZE and then it works [13:31:07] somtimes, there is not much to do [13:31:14] jynus: db2017 now works fine :) [13:31:47] Before the query has been done, I get a database close but I cannot filter the query more. [13:32:00] that normally means the query is detected as slow [13:32:25] so that part is expected [13:33:39] ok thanks jynus, I won't close that task then but I'll drop from high to normal [13:35:24] my personal policy is- if you give us details about the issue and see something wrong that is actionable, can fix it, but we cannot support ourselves users's query optimization [13:35:37] as we can barely supporty production issues [13:36:40] but anyone can check it and we will be willing to apply any proposed fix- sometimes it is not easy though- labs has limitations due to the extra filtering [13:37:10] jynus: totally understood [13:39:34] 10DBA, 10Labs, 10Labs-Infrastructure, 10cloud-services-team: SQL dewiki_p categorylinks cl_from index missing - https://phabricator.wikimedia.org/T169038#3386490 (10jcrespo) Without looking too much in depth, there seems to be some unexpected query plan- there is not much we can do about it- except maybe r... [14:06:49] 10Blocked-on-schema-change, 10DBA: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3386778 (10Marostegui) [14:08:01] 10DBA, 10Labs, 10Labs-Infrastructure, 10cloud-services-team: SQL dewiki_p categorylinks cl_from index missing - https://phabricator.wikimedia.org/T169038#3386781 (10Bawolff) >>! In T169038#3386490, @jcrespo wrote: > Without looking too much in depth, there seems to be some unexpected query plan- there is n... [15:03:23] 10Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s1, s2, s4, s5 and s7 (eqiad) - https://phabricator.wikimedia.org/T164185#3387075 (10jcrespo) [15:03:26] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387071 (10jcrespo) 05Resolved>03Open labsdb1003 seems to be missing a primary key- T78730 [15:06:10] 10DBA, 10Labs, 10Labs-Infrastructure, 10cloud-services-team: SQL dewiki_p categorylinks cl_from index missing - https://phabricator.wikimedia.org/T169038#3387086 (10jcrespo) p:05Normal>03High Ok, sorry, I understand now (the initial email with the complex join confused me)- I can see now that the host... [15:15:40] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387152 (10jcrespo) a:05Marostegui>03jcrespo labsdb1003 seems to be missing a primary key- T166207 [15:16:48] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387163 (10jcrespo) labsdb1003 seems to be missing a primary key- T169038 [15:17:39] 10DBA, 10Labs, 10Labs-Infrastructure, 10cloud-services-team: SQL dewiki_p categorylinks cl_from index missing - https://phabricator.wikimedia.org/T169038#3385371 (10jcrespo) Probably related to T166207 (alters sometimes timeout due to excesive load on labsdb* hosts). [15:19:05] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387178 (10Marostegui) Looks like it failed for dewiki (but not for wikidatawiki) [15:19:46] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387180 (10jcrespo) I am running alter now. [15:20:44] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387182 (10Marostegui) labsdb1001 is fine, so it might have been that punctual error, with so many alters to be done I probably missed it. Sorry... [15:31:49] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3387247 (10Marostegui) s6 has complained again with the same issue, a duplicate entry on a REPLACE INTO for the same table: watchlist [15:32:59] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3387255 (10jcrespo) yeah, after a crashm, on a non-gtid slave, we can consider the host as broken. We should just migrate backups to dbstore2001. [15:34:34] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3387256 (10Marostegui) >>! In T169050#3387255, @jcrespo wrote: > yeah, after a crashm, on a non-gtid slave, we can consider the host as broken. We should just migrate backups to dbstore2001.... [15:36:29] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3387258 (10jcrespo) I do not really have a strong mind of doing or not doing that- I only have clear we should just build those multi-instance shards and forget about the current hosts. [15:44:30] 10DBA, 10Operations, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3387288 (10jcrespo) With the new puppet refactoring, hosts just work- that doesn't mean the puppet structure is ideal- we need to change many things such as multi-instance support and fix the... [15:48:35] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3387309 (10Marostegui) Totally agree that we have to move on. Exporting and importing the table didn't fix it, so it is really broken. I will set the filters, let it catch up and reimport i... [16:23:03] 10DBA, 10Labs, 10Labs-Infrastructure, 10cloud-services-team: SQL dewiki_p categorylinks cl_from index missing - https://phabricator.wikimedia.org/T169038#3387468 (10jcrespo) 05Open>03Resolved a:03jcrespo This took less time than expected, but should now be fixed: ``` MariaDB [dewiki]> EXPLAIN SELECT... [16:24:44] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387488 (10jcrespo) Done, is this the right final state? ``` MariaDB [dewiki_p]> SHOW CREATE TABLE dewiki.categorylinks\G *********************... [16:44:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387585 (10Marostegui) Yep! That is it! Thanks! [16:45:17] 10Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s1, s2, s4, s5 and s7 (eqiad) - https://phabricator.wikimedia.org/T164185#3387587 (10jcrespo) [16:45:20] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3387586 (10jcrespo) 05Open>03Resolved [18:25:07] 10DBA, 10ArchCom-RfC, 10RESTBase-API, 10Reading List Service, and 4 others: RfC: Reading List service - https://phabricator.wikimedia.org/T164990#3387923 (10Fjalapeno)