[06:44:44] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647450 (10Marostegui) The sysctl settings error looks gone now, and I can read them actually:  ``` root@db1082:/proc/sys/net# sysctl -a | wc -l  1702  ```  The offset error looks weird:  ``` root@db1082:/proc/sys/net# nt...
[07:20:03] <wikibugs>	 10DBA, 06Operations: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2647480 (10Marostegui) I have renamed the table in the following codfw hosts:  ``` db2034.codfw.wmnet db2042.codfw.wmnet db2048.codfw.wmnet db2055.codfw.wmnet db2062.codfw.wmnet db2069.codfw...
[08:23:22] <wikibugs>	 10DBA, 06Operations: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2647521 (10Marostegui) I have renamed the table in eqiad hosts (the already exists errors are because those hosts were used as canary:  S1 - enwiki:  ``` root@neodymium:/home/mar...
[08:33:41] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647554 (10Marostegui) NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine
[08:45:40] <wikibugs>	 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647587 (10Marostegui) a:03Marostegui
[08:51:59] <wikibugs>	 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647640 (10Marostegui) As we discussed, we will also test InnoDB compression (T139055)
[09:30:18] <wikibugs>	 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647735 (10Marostegui) db1070 - 1.1TB ibdata file out of 1.3T used  db1082 - 1.1TB ibdata file out of 1.6T used (see T145533 as this server is possibly right now in a weird state)  db1087 - 1.1TB ibdata file out of 1.3T...
[09:36:46] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647755 (10jcrespo) >>! In T145533#2647554, @Marostegui wrote: > NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine  In fact, I would reboot it several times to see if it happens again...
[09:38:20] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647757 (10Marostegui) Makes clear, thanks for giving me context on past issues. I will do that for for a few times and by the end of the day I will give it another final reboot and leave it like that for a few days, just...
[10:00:14] <wikibugs>	 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2647769 (10jcrespo) By, experience, I would recommend against mysqldump. I would suggest mydumper with the slave stopped.   Or better, probably, stop the slave, perform with some paralelism ENGINE=INNODB, FORCE in a serv...
[10:24:06] <jynus>	 powercycling db1061
[10:24:55] <_joe_>	 I have a dumb SQL question
[10:25:03] <_joe_>	 say I want to do 
[10:25:21] <jynus>	 postgres?
[10:25:26] <_joe_>	 no, mysql
[10:25:26] <jynus>	 go on
[10:25:27] <_joe_>	 :)
[10:25:28] <jynus>	 oh
[10:25:38] <_joe_>	 select * from mytable order by field1 ASC
[10:25:45] <_joe_>	 I get in the results
[10:25:52] <_joe_>	 thumbor.svc.eqiad.wmnet
[10:25:59] <_joe_>	 before thumbor1001.eqiad.wmnet
[10:26:05] <jynus>	 ok
[10:26:11] <jynus>	 you want different ordering
[10:26:11] <_joe_>	 so '.' takes precedence on '1' or any letter
[10:26:17] <_joe_>	 same than '_'
[10:26:32] <jynus>	 by default you get the ordering on config or the intrinsic by the table
[10:26:37] <_joe_>	 well I'd like python's sort() and mysql ORDER BY to use the same ordering 
[10:26:40] <jynus>	 you should check that
[10:26:44] <_joe_>	 it's due to the collation, right?
[10:26:53] <jynus>	 but you can change it, if you want
[10:27:02] <_joe_>	 yes, let me check then
[10:27:07] <jynus>	 with I think it is the COLLATE keyword
[10:27:32] <jynus>	 so you may want utf8mb4 (UTF-8), binary, ASCII
[10:27:55] <_joe_>	 yeah I'll make some tests
[10:28:18] <jynus>	 if you tell me where that is, I can help you faster
[10:28:32] <jynus>	 and it is not a dumb question at all
[10:29:04] <jynus>	 I recently asked mediawiki's dbslist to be ordered using unicode collation (unix sort) instead of binary
[10:29:40] <jynus>	 as a tip, you can execute sort by default (utf8) or with LANG=C to see the differences
[10:30:35] <_joe_>	 latin1_swedish_ci
[10:30:36] <_joe_>	 sigh
[10:30:40] <jynus>	 this is the relevant bit if done at query time: http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html
[10:30:53] <jynus>	 so, I would suggest to fix the config/table
[10:30:58] <_joe_>	 why is that db using that collation
[10:30:59] <jynus>	 instead of doing the above
[10:31:03] <_joe_>	 :P
[10:31:09] <_joe_>	 yeah I was about to suggest the same
[10:31:18] <jynus>	 well, if you tell me which db it is, I can tell you
[10:31:40] <_joe_>	 puppet
[10:31:49] <_joe_>	 what else
[10:31:50] <jynus>	 let me see
[10:32:02] <_joe_>	 I was looking at the resources and param_names tables
[10:32:44] <volans>	 or you can write your own collation... http://dev.mysql.com/doc/refman/5.6/en/adding-collation-simple-8bit.html
[10:32:59] <_joe_>	 volans: shush! this is serious business :P
[10:33:05] <volans>	 :-P
[10:33:20] <_joe_>	 I'm pretty sure that anything other than latin1_swedish_ci would be ok
[10:33:24] <_joe_>	 even ASCII
[10:34:11] <jynus>	 _joe_, the instance is well configured
[10:34:15] <_joe_>	 jynus: I'll try the queries with different collations and see which one is ok
[10:34:21] <jynus>	 if you see SHOW VARIABLES like 'char%';
[10:34:30] <_joe_>	 jynus: I am pretty sure that db has passed over from db to db for ages
[10:34:32] <jynus>	 it is the database creation and thus all tables
[10:34:36] <_joe_>	 db host to db host
[10:34:42] <jynus>	 that were created wrongly
[10:34:50] <jynus>	 character_set_database   | latin1
[10:35:06] <_joe_>	 yes, so let me try just to do COLLATE utf8mb4 
[10:35:07] <jynus>	 CREATE DATABASE `puppet` /*!40100 DEFAULT CHARACTER SET latin1 */
[10:35:30] <jynus>	 so, I do not know if it makes sense to fix something that it is going to be deleted
[10:35:39] <_joe_>	 my point
[10:35:43] <jynus>	 but you can just use the link I sent you
[10:35:52] <jynus>	 to get any ordering
[10:36:21] <_joe_>	 jynus: yes, I should've looked at the db first, then ask questions :P
[10:36:21] <jynus>	 "ORDER BY X COLLATE Y"
[10:36:31] <_joe_>	 jynus: yes, I even know the syntax
[10:36:43] <_joe_>	 I just dumbly assumed it was utf-8 by default everywhere
[10:36:54] <jynus>	 "SHOW COLLATION;"
[10:37:00] <jynus>	 will show you all available ones
[10:37:18] <jynus>	 difficult you will not find one you wouldn't want
[10:37:28] <jynus>	 note that we are a bit special
[10:37:36] <jynus>	 because as mediawiki mainly uses binary collation
[10:37:46] <jynus>	 we end up doing a lot of things in binary
[10:38:10] <jynus>	 but that is by default, no reason to do it if it is not desired
[10:52:08] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647837 (10jcrespo) Just one thing, rebooting would be a great way to test https://gerrit.wikimedia.org/r/#/c/310564/ In fact, I am going to test it on db1061 now, too.
[10:54:32] <_joe_>	 select * from param_names ORDER BY name COLLATE utf8_bin;
[10:54:32] <_joe_>	 ERROR 1253 (42000): COLLATION 'utf8_bin' is not valid for CHARACTER SET 'latin1'
[10:54:35] <_joe_>	 uhm
[10:57:10] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647838 (10jcrespo)
[10:57:22] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647853 (10jcrespo)
[10:57:28] <volans>	 _joe_: see https://dev.mysql.com/doc/refman/5.7/en/charset-binary-collations.html
[10:57:55] <jynus>	 _joe_, if data is in latin1, only latin1 collation can be used
[10:58:11] <_joe_>	 jynus: which don't do what I want
[10:58:23] <_joe_>	 but let me run some tests
[10:58:32] <jynus>	 pager grep latin1
[10:58:36] <jynus>	 SHOW COLLATION;
[10:58:43] <jynus>	 pager less
[10:59:12] <jynus>	 then you must convert the table to another collation, latin1 is limited for a reason :-)
[10:59:45] <_joe_>	 jynus: I created a copy of param_names, which has 43 rows, to run some tests
[10:59:52] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647854 (10Marostegui) Sounds good - I have rebooted it twice already and expect to do a few more before the end of the day.
[10:59:59] <_joe_>	 I'll drop the new table as soon as I am done
[11:00:03] <jynus>	 ok
[11:00:25] <jynus>	 let me know if I can help, I know you can do it
[11:00:34] <jynus>	 it is just that we can do it faster
[11:00:41] <jynus>	 :-)
[11:02:33] <_joe_>	 jynus: I would need to use ascii 
[11:02:36] <_joe_>	 as a charset
[11:02:47] <_joe_>	 because of course, python 2.x treats strings as ascii
[11:02:57] <_joe_>	 so it's python's fault, not mysql's
[11:03:10] <_joe_>	 let me try to fix the python code first
[11:03:14] <jynus>	 oh!
[11:03:31] <jynus>	 yes, make sure it is using unicode strings
[11:03:47] <jynus>	 (sorry, I always think it is mysql problem)
[11:04:03] <jynus>	 the whole "u''"
[11:11:22] <_joe_>	 jynus: nah I just mixed up orderings there
[11:12:11] <jynus>	 the database wan't right, either
[11:26:15] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647878 (10jcrespo) a:03jcrespo
[11:26:23] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647838 (10jcrespo) p:05Triage>03Normal
[11:33:11] <jynus>	 interesting, I think db1061 did not page
[11:33:17] <jynus>	 which is good
[11:33:29] <jynus>	 but I do not know why
[11:34:09] <jynus>	 probably because hosts do not do that, and services depend on the host
[11:34:39] <jynus>	 if that is the case, that is the intended behaviour
[11:58:53] <volans>	 jynus: it didn't page because the host check contact_groups is admins (doesn't have sms) and the services ones with sms I think are dependent on the host itself
[11:58:58] <volans>	 [1474247594] HOST NOTIFICATION: irc;db1061;DOWN;notify-host-by-irc;PING CRITICAL - Packet loss = 100%
[12:10:21] <jynus>	 which is good
[12:26:40] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647949 (10Marostegui) Everytime the server gets restarted NTP alerts until I run the ntp sync manually. I have rebooted again to see how it comes back and what happens if I do not touch it.
[12:32:32] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647953 (10MoritzMuehlenhoff) How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart of NTP-
[12:32:34] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647954 (10jcrespo) @Marostegui NTP being off for some minutes is "normal" (Known limitation with low priority) What it was an issue/strange is it being off for hours/days.
[12:33:47] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647961 (10jcrespo) @MoritzMuehlenhoff see above:  >>! In T145533#2645185, @jcrespo wrote: >> Check size of conntrack table >>   >> Notifications for this service have been disabled >>  WARNING  2016-09-17 03:19:29  1d 13...
[12:34:55] <wikibugs>	 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647964 (10Marostegui) >>! In T145533#2647953, @MoritzMuehlenhoff wrote: > How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart...
[13:19:52] <wikibugs>	 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 7 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619222 (10hashar) Deployed on current wmf.18 as well as next wmf.19.
[15:52:12] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2648683 (10Marostegui) I have renamed the tables on these hosts today   ``` dbstore1001.eqiad.wmnet dbstore1002.eqiad.wmnet labsdb1001.eqiad.wmnet labsdb1003.eqiad.wmnet db1069.eqiad.wmn...
[17:00:50] <jynus>	 as I predicted, labsdb1004's error is "Error 'Index column size too large. The maximum column size is 767 bytes.' on query."
[17:13:42] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2649157 (10jcrespo) Note that by dropping hitcounter from labs hosts but keeping it on db1069 and restarting the server (which was lagged), replication broke:   ``` MariaDB  labsdb1001 (...
[18:37:20] <hoo>	 Having load on db1026 is awry :/
[18:37:35] <hoo>	 that host is so much slower than the other s5 dbs
[18:38:26] <jynus>	 it is 1 load
[18:38:38] <hoo>	 Yes, but it still gets picked up occasionally
[18:38:40] <jynus>	 if it had 0
[18:38:45] <hoo>	 and one of our dumpers picked it up
[18:38:45] <jynus>	 it would lag behinf
[18:38:56] <jynus>	 dumpers?
[18:38:58] <hoo>	 meaning dumping that shard takes at least 5h longer than the others
[18:39:06] <hoo>	 For the dumps at https://dumps.wikimedia.org/wikidatawiki/entities/
[18:39:08] <jynus>	 why do dumps use main servers?
[18:39:12] <jynus>	 that is a bug on your code
[18:39:22] <hoo>	 Because of architecture foobar
[18:39:25] <hoo>	 yes
[18:39:42] <jynus>	 because you violate the contract 
[18:39:48] <jynus>	 because you want faster servers
[18:39:56] <jynus>	 so not you are paying for it
[18:40:03] <jynus>	 *now
[18:40:15] <jynus>	 bad decision == bad consequences
[18:40:32] <hoo>	 web requests also end up there
[18:40:41] <jynus>	 only 1/1000
[18:41:05] <jynus>	 the diference is 20ms vs. 30 ms on normal requests
[18:41:12] <jynus>	 not noticeable
[18:41:39] <jynus>	 if you send dumps to that server, you will notice it
[18:41:43] <jynus>	 slowing it down
[18:41:49] <jynus>	 again, a bad idea
[18:42:11] <jynus>	 so your dumps are slowing user requests
[18:43:01] <jynus>	 fix your code, and then I will give you faster servers
[18:43:59] <jynus>	 we are about to purchase s8
[19:10:01] <hoo>	 I wish I could…
[19:11:53] <jynus>	 :-D
[19:12:01] <jynus>	 tell someone to fix it for you!
[19:12:11] <jynus>	 it work for me :-)
[19:12:49] <jynus>	 it is literally changing mediawikis slave selection to 'dumps', instead of regular servers
[19:13:19] <hoo>	 Yeah, but given how Wikibase services are wired, that's a little more work
[19:13:23] <hoo>	 I put it on my todo list
[19:13:34] <jynus>	 yes, we talked about this
[19:13:37] <jynus>	 in the past
[19:14:00] <hoo>	 We have many storage access service classes… and they all need to be taught about this somehow
[19:14:08] <jynus>	 but it is technical debt that has to be done at some point
[19:14:19] <jynus>	 putting there a $group
[19:14:27] <jynus>	 and being able to change it in the future
[19:15:12] <jynus>	 or with a class for wikidata
[19:15:14] <jynus>	 that does that
[19:15:25] <jynus>	 yes, I am not saying it is easy
[19:19:04] <jynus>	 db1049 is having connections issues, BTW
[19:20:28] <jynus>	 lots of "/rpc/RunJobs.php?wiki=enwiki&type=refreshLinksPrioritized&maxtime=60&maxmem=300M"
[19:20:38] <jynus>	 failing to connect to s5-master
[19:21:26] <jynus>	 looks like row contention
[19:22:37] <hoo>	 hm
[19:23:06] <hoo>	 Can you open a bug about that, if you have more details?
[19:23:53] <jynus>	 I am
[19:24:09] <jynus>	 could be nothing (spike of load)
[19:24:13] <jynus>	 could be code
[19:24:25] <jynus>	 I would bet on the second based on starting at 17 UTC
[19:25:34] <hoo>	 thanks
[19:40:55] <wikibugs>	 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized job running for all wikis - https://phabricator.wikimedia.org/T146079#2649995 (10jcrespo)
[19:41:06] <jynus>	 hoo ^
[19:44:22] <wikibugs>	 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis - https://phabricator.wikimedia.org/T146079#2650013 (10jcrespo)
[21:22:58] <wikibugs>	 10DBA, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, 10Flow, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2650437 (10Catrope)
[21:26:44] <wikibugs>	 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis - https://phabricator.wikimedia.org/T146079#2650449 (10hoo) Since 98bd2437ae38f395a8b47e6895793e88ca3ae6b4 we use larger transactions in E...
[21:26:48] * AaronSchulz reads backscroll
[21:27:16] <AaronSchulz>	 hoo: yeah, I know the DB_REPLICA patches ended up being a lot of work for you. Much appreciated :)
[21:27:31] <AaronSchulz>	 (e.g. passing the flags down a half-dozen layers of abstraction)
[21:29:24] <hoo>	 I'll look into this more later this week… not working much today + tomorrow
[21:29:41] <hoo>	 cu
[21:30:14] <wikibugs>	 10DBA, 10MediaWiki-JobRunner, 10Wikidata, 07Wikimedia-log-errors: s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis - https://phabricator.wikimedia.org/T146079#2649995 (10aaron) Does addUsages() get called when no other writes are pending commit? If so,...
[23:42:43] <wikibugs>	 10DBA, 10MediaWiki-extensions-ORES, 07Performance: hidenondamaging=1 query is extremely slow on enwiki - https://phabricator.wikimedia.org/T146111#2650989 (10Catrope)
[23:48:42] <wikibugs>	 10DBA, 10MediaWiki-extensions-ORES, 07Performance: hidenondamaging=1 query is extremely slow on enwiki - https://phabricator.wikimedia.org/T146111#2651010 (10Catrope) Looks like `STRAIGHT_JOIN` works around the optimizer bug:  ``` mysql:research@s3-analytics-slave [enwiki]> explain SELECT /*!STRAIGHT_JOIN*/...