[06:44:44] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647450 (10Marostegui) The sysctl settings error looks gone now, and I can read them actually: ``` root@db1082:/proc/sys/net# sysctl -a | wc -l 1702 ``` The offset error looks weird: ``` root@db1082:/proc/sys/net# nt... [07:20:03] 10DBA, 06Operations: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2647480 (10Marostegui) I have renamed the table in the following codfw hosts: ``` db2034.codfw.wmnet db2042.codfw.wmnet db2048.codfw.wmnet db2055.codfw.wmnet db2062.codfw.wmnet db2069.codfw... [08:23:22] 10DBA, 06Operations: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2647521 (10Marostegui) I have renamed the table in eqiad hosts (the already exists errors are because those hosts were used as canary: S1 - enwiki: ``` root@neodymium:/home/mar... [08:33:41] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647554 (10Marostegui) NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine [08:45:40] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647587 (10Marostegui) a:03Marostegui [08:51:59] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647640 (10Marostegui) As we discussed, we will also test InnoDB compression (T139055) [09:30:18] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647735 (10Marostegui) db1070 - 1.1TB ibdata file out of 1.3T used db1082 - 1.1TB ibdata file out of 1.6T used (see T145533 as this server is possibly right now in a weird state) db1087 - 1.1TB ibdata file out of 1.3T... [09:36:46] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647755 (10jcrespo) >>! In T145533#2647554, @Marostegui wrote: > NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine In fact, I would reboot it several times to see if it happens again... [09:38:20] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647757 (10Marostegui) Makes clear, thanks for giving me context on past issues. I will do that for for a few times and by the end of the day I will give it another final reboot and leave it like that for a few days, just... [10:00:14] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647769 (10jcrespo) By, experience, I would recommend against mysqldump. I would suggest mydumper with the slave stopped. Or better, probably, stop the slave, perform with some paralelism ENGINE=INNODB, FORCE in a serv... [10:24:06] powercycling db1061 [10:24:55] <_joe_> I have a dumb SQL question [10:25:03] <_joe_> say I want to do [10:25:21] postgres? [10:25:26] <_joe_> no, mysql [10:25:26] go on [10:25:27] <_joe_> :) [10:25:28] oh [10:25:38] <_joe_> select * from mytable order by field1 ASC [10:25:45] <_joe_> I get in the results [10:25:52] <_joe_> thumbor.svc.eqiad.wmnet [10:25:59] <_joe_> before thumbor1001.eqiad.wmnet [10:26:05] ok [10:26:11] you want different ordering [10:26:11] <_joe_> so '.' takes precedence on '1' or any letter [10:26:17] <_joe_> same than '_' [10:26:32] by default you get the ordering on config or the intrinsic by the table [10:26:37] <_joe_> well I'd like python's sort() and mysql ORDER BY to use the same ordering [10:26:40] you should check that [10:26:44] <_joe_> it's due to the collation, right? [10:26:53] but you can change it, if you want [10:27:02] <_joe_> yes, let me check then [10:27:07] with I think it is the COLLATE keyword [10:27:32] so you may want utf8mb4 (UTF-8), binary, ASCII [10:27:55] <_joe_> yeah I'll make some tests [10:28:18] if you tell me where that is, I can help you faster [10:28:32] and it is not a dumb question at all [10:29:04] I recently asked mediawiki's dbslist to be ordered using unicode collation (unix sort) instead of binary [10:29:40] as a tip, you can execute sort by default (utf8) or with LANG=C to see the differences [10:30:35] <_joe_> latin1_swedish_ci [10:30:36] <_joe_> sigh [10:30:40] this is the relevant bit if done at query time: http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html [10:30:53] so, I would suggest to fix the config/table [10:30:58] <_joe_> why is that db using that collation [10:30:59] instead of doing the above [10:31:03] <_joe_> :P [10:31:09] <_joe_> yeah I was about to suggest the same [10:31:18] well, if you tell me which db it is, I can tell you [10:31:40] <_joe_> puppet [10:31:49] <_joe_> what else [10:31:50] let me see [10:32:02] <_joe_> I was looking at the resources and param_names tables [10:32:44] or you can write your own collation... http://dev.mysql.com/doc/refman/5.6/en/adding-collation-simple-8bit.html [10:32:59] <_joe_> volans: shush! this is serious business :P [10:33:05] :-P [10:33:20] <_joe_> I'm pretty sure that anything other than latin1_swedish_ci would be ok [10:33:24] <_joe_> even ASCII [10:34:11] _joe_, the instance is well configured [10:34:15] <_joe_> jynus: I'll try the queries with different collations and see which one is ok [10:34:21] if you see SHOW VARIABLES like 'char%'; [10:34:30] <_joe_> jynus: I am pretty sure that db has passed over from db to db for ages [10:34:32] it is the database creation and thus all tables [10:34:36] <_joe_> db host to db host [10:34:42] that were created wrongly [10:34:50] character_set_database | latin1 [10:35:06] <_joe_> yes, so let me try just to do COLLATE utf8mb4 [10:35:07] CREATE DATABASE `puppet` /*!40100 DEFAULT CHARACTER SET latin1 */ [10:35:30] so, I do not know if it makes sense to fix something that it is going to be deleted [10:35:39] <_joe_> my point [10:35:43] but you can just use the link I sent you [10:35:52] to get any ordering [10:36:21] <_joe_> jynus: yes, I should've looked at the db first, then ask questions :P [10:36:21] "ORDER BY X COLLATE Y" [10:36:31] <_joe_> jynus: yes, I even know the syntax [10:36:43] <_joe_> I just dumbly assumed it was utf-8 by default everywhere [10:36:54] "SHOW COLLATION;" [10:37:00] will show you all available ones [10:37:18] difficult you will not find one you wouldn't want [10:37:28] note that we are a bit special [10:37:36] because as mediawiki mainly uses binary collation [10:37:46] we end up doing a lot of things in binary [10:38:10] but that is by default, no reason to do it if it is not desired [10:52:08] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647837 (10jcrespo) Just one thing, rebooting would be a great way to test https://gerrit.wikimedia.org/r/#/c/310564/ In fact, I am going to test it on db1061 now, too. [10:54:32] <_joe_> select * from param_names ORDER BY name COLLATE utf8_bin; [10:54:32] <_joe_> ERROR 1253 (42000): COLLATION 'utf8_bin' is not valid for CHARACTER SET 'latin1' [10:54:35] <_joe_> uhm [10:57:10] 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647838 (10jcrespo) [10:57:22] 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647853 (10jcrespo) [10:57:28] _joe_: see https://dev.mysql.com/doc/refman/5.7/en/charset-binary-collations.html [10:57:55] _joe_, if data is in latin1, only latin1 collation can be used [10:58:11] <_joe_> jynus: which don't do what I want [10:58:23] <_joe_> but let me run some tests [10:58:32] pager grep latin1 [10:58:36] SHOW COLLATION; [10:58:43] pager less [10:59:12] then you must convert the table to another collation, latin1 is limited for a reason :-) [10:59:45] <_joe_> jynus: I created a copy of param_names, which has 43 rows, to run some tests [10:59:52] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647854 (10Marostegui) Sounds good - I have rebooted it twice already and expect to do a few more before the end of the day. [10:59:59] <_joe_> I'll drop the new table as soon as I am done [11:00:03] ok [11:00:25] let me know if I can help, I know you can do it [11:00:34] it is just that we can do it faster [11:00:41] :-) [11:02:33] <_joe_> jynus: I would need to use ascii [11:02:36] <_joe_> as a charset [11:02:47] <_joe_> because of course, python 2.x treats strings as ascii [11:02:57] <_joe_> so it's python's fault, not mysql's [11:03:10] <_joe_> let me try to fix the python code first [11:03:14] oh! [11:03:31] yes, make sure it is using unicode strings [11:03:47] (sorry, I always think it is mysql problem) [11:04:03] the whole "u''" [11:11:22] <_joe_> jynus: nah I just mixed up orderings there [11:12:11] the database wan't right, either [11:26:15] 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647878 (10jcrespo) a:03jcrespo [11:26:23] 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647838 (10jcrespo) p:05Triage>03Normal [11:33:11] interesting, I think db1061 did not page [11:33:17] which is good [11:33:29] but I do not know why [11:34:09] probably because hosts do not do that, and services depend on the host [11:34:39] if that is the case, that is the intended behaviour [11:58:53] jynus: it didn't page because the host check contact_groups is admins (doesn't have sms) and the services ones with sms I think are dependent on the host itself [11:58:58] [1474247594] HOST NOTIFICATION: irc;db1061;DOWN;notify-host-by-irc;PING CRITICAL - Packet loss = 100% [12:10:21] which is good [12:26:40] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647949 (10Marostegui) Everytime the server gets restarted NTP alerts until I run the ntp sync manually. I have rebooted again to see how it comes back and what happens if I do not touch it. [12:32:32] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647953 (10MoritzMuehlenhoff) How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart of NTP- [12:32:34] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647954 (10jcrespo) @Marostegui NTP being off for some minutes is "normal" (Known limitation with low priority) What it was an issue/strange is it being off for hours/days. [12:33:47] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647961 (10jcrespo) @MoritzMuehlenhoff see above: >>! In T145533#2645185, @jcrespo wrote: >> Check size of conntrack table >> >> Notifications for this service have been disabled >> WARNING 2016-09-17 03:19:29 1d 13... [12:34:55] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647964 (10Marostegui) >>! In T145533#2647953, @MoritzMuehlenhoff wrote: > How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart... [13:19:52] 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 7 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619222 (10hashar) Deployed on current wmf.18 as well as next wmf.19. [15:52:12] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2648683 (10Marostegui) I have renamed the tables on these hosts today ``` dbstore1001.eqiad.wmnet dbstore1002.eqiad.wmnet labsdb1001.eqiad.wmnet labsdb1003.eqiad.wmnet db1069.eqiad.wmn... [17:00:50] as I predicted, labsdb1004's error is "Error 'Index column size too large. The maximum column size is 767 bytes.' on query." [17:13:42] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2649157 (10jcrespo) Note that by dropping hitcounter from labs hosts but keeping it on db1069 and restarting the server (which was lagged), replication broke: ``` MariaDB labsdb1001 (... [18:37:20] Having load on db1026 is awry :/ [18:37:35] that host is so much slower than the other s5 dbs [18:38:26] it is 1 load [18:38:38] Yes, but it still gets picked up occasionally [18:38:40] if it had 0 [18:38:45] and one of our dumpers picked it up [18:38:45] it would lag behinf [18:38:56] dumpers? [18:38:58] meaning dumping that shard takes at least 5h longer than the others [18:39:06] For the dumps at https://dumps.wikimedia.org/wikidatawiki/entities/ [18:39:08] why do dumps use main servers? [18:39:12] that is a bug on your code [18:39:22] Because of architecture foobar [18:39:25] yes [18:39:42] because you violate the contract [18:39:48] because you want faster servers [18:39:56] so not you are paying for it [18:40:03] *now [18:40:15] bad decision == bad consequences [18:40:32] web requests also end up there [18:40:41] only 1/1000 [18:41:05] the diference is 20ms vs. 30 ms on normal requests [18:41:12] not noticeable [18:41:39] if you send dumps to that server, you will notice it [18:41:43] slowing it down [18:41:49] again, a bad idea [18:42:11] so your dumps are slowing user requests [18:43:01] fix your code, and then I will give you faster servers [18:43:59] we are about to purchase s8 [19:10:01] I wish I could… [19:11:53] :-D [19:12:01] tell someone to fix it for you! [19:12:11] it work for me :-) [19:12:49] it is literally changing mediawikis slave selection to 'dumps', instead of regular servers [19:13:19] Yeah, but given how Wikibase services are wired, that's a little more work [19:13:23] I put it on my todo list [19:13:34] yes, we talked about this [19:13:37] in the past [19:14:00] We have many storage access service classes… and they all need to be taught about this somehow [19:14:08] but it is technical debt that has to be done at some point [19:14:19] putting there a $group [19:14:27] and being able to change it in the future [19:15:12] or with a class for wikidata [19:15:14] that does that [19:15:25] yes, I am not saying it is easy [19:19:04] db1049 is having connections issues, BTW [19:20:28] lots of "/rpc/RunJobs.php?wiki=enwiki&type=refreshLinksPrioritized&maxtime=60&maxmem=300M" [19:20:38] failing to connect to s5-master [19:21:26] looks like row contention [19:22:37] hm [19:23:06] Can you open a bug about that, if you have more details? [19:23:53] I am [19:24:09] could be nothing (spike of load) [19:24:13] could be code [19:24:25] I would bet on the second based on starting at 17 UTC [19:25:34] thanks [19:40:55] 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized job running for all wikis - https://phabricator.wikimedia.org/T146079#2649995 (10jcrespo) [19:41:06] hoo ^ [19:44:22] 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis - https://phabricator.wikimedia.org/T146079#2650013 (10jcrespo) [21:22:58] 10DBA, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, 10Flow, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2650437 (10Catrope) [21:26:44] 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis - https://phabricator.wikimedia.org/T146079#2650449 (10hoo) Since 98bd2437ae38f395a8b47e6895793e88ca3ae6b4 we use larger transactions in E... [21:26:48] * AaronSchulz reads backscroll [21:27:16] hoo: yeah, I know the DB_REPLICA patches ended up being a lot of work for you. Much appreciated :) [21:27:31] (e.g. passing the flags down a half-dozen layers of abstraction) [21:29:24] I'll look into this more later this week… not working much today + tomorrow [21:29:41] cu [21:30:14] 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis - https://phabricator.wikimedia.org/T146079#2649995 (10aaron) Does addUsages() get called when no other writes are pending commit? If so,... [23:42:43] 10DBA, 10MediaWiki-extensions-ORES, 07Performance: hidenondamaging=1 query is extremely slow on enwiki - https://phabricator.wikimedia.org/T146111#2650989 (10Catrope) [23:48:42] 10DBA, 10MediaWiki-extensions-ORES, 07Performance: hidenondamaging=1 query is extremely slow on enwiki - https://phabricator.wikimedia.org/T146111#2651010 (10Catrope) Looks like `STRAIGHT_JOIN` works around the optimizer bug: ``` mysql:research@s3-analytics-slave [enwiki]> explain SELECT /*!STRAIGHT_JOIN*/...