[05:50:51] <marostegui>	 es2015 still has the sysctl error :-(
[05:50:57] <marostegui>	 Maybe we should give it a reboot?
[05:52:24] <wikibugs>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2733733 (10Marostegui) Maybe this server still needs a reboot, as it has been having the icinga warning about not be...
[05:56:31] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733734 (10Marostegui) p:05Triage>03Normal a:03Marostegui
[07:17:34] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2733752 (10Marostegui) I had a chat with Jaime yesterday about the past issues with the wildcard-based privileges and it is certainily worrying. Probably...
[07:46:13] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2733815 (10Marostegui) So the tricky part is this one:  ```  DEFINER=viewmaster ```  As per MySQL documentation:  ``` If you specify the DEFINER clause, y...
[08:18:12] <wikibugs>	 10DBA: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733838 (10jcrespo) a:03jcrespo
[08:33:35] <wikibugs>	 10DBA: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733846 (10jcrespo) Applied to db1070- the reason why it was not active is because it was a former master; so it had the configuration for a normal master.  Trying to detect other p...
[08:53:56] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733873 (10Marostegui) Hi,   I have deployed this new index in `db2034.codfw.wmnet` S1.enwiki and this is the final table schema:  ``` MariaD...
[08:57:10] <wikibugs>	 10DBA, 13Patch-For-Review: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733877 (10jcrespo) Pending hosts:  * es hosts- it will not be applied * Old masters: needs to change its events to slaves * New masters: needs to change its e...
[09:04:58] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733886 (10Ladsgroup) It should be as the same as the schema [[ https://github.com/wikimedia/mediawiki-extensions-ORES/blob/master/sql/ores_m...
[09:08:14] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733888 (10Marostegui) >>! In T147734#2733886, @Ladsgroup wrote: > It should be as the same as the schema [[ https://github.com/wikimedia/med...
[09:14:40] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733898 (10Ladsgroup) Thank you!  > If you need it somewhere else faster, I can also do that for you. Nah, There is no need. Thanks
[09:16:02] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733900 (10jcrespo) > Which looks the same except varchar turned into var binary. Which I don't know is expected or not.  Yes, mediawiki at W...
[09:16:38] <marostegui>	 Thanks jynus for the context :-)
[09:17:21] <jynus>	 ^this is the reason pt-table-checksum breaks on our env
[09:17:35] <jynus>	 for all but the latest version, where I sent a patch
[09:17:48] <marostegui>	 Ah interesting
[09:17:58] <marostegui>	 Will they release it?
[09:18:29] <jynus>	 on repo it is already, but I do not know if they have released it
[09:18:57] <jynus>	 the change is very easy- just a one line forcing the columns to be utf8
[09:18:58] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733901 (10Marostegui) Change applied to all S1 codfw hosts:  ``` root@neodymium:/home/marostegui/git# for i in `cat software/dbtools/s1.host...
[09:26:56] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733934 (10Ladsgroup) Thanks for explanation and the deployment.
[09:43:08] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733973 (10Marostegui) Deployed in S1.enwiki eqiad. So S1 is done:  ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3311 enwiki -e...
[09:49:15] <wikibugs>	 10DBA, 13Patch-For-Review: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733991 (10jcrespo) 05Open>03Resolved I have applied master events to all masters, eqiad and codfw, slave events to all slaves (on the core s1-s7 shards)....
[09:59:32] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734014 (10Marostegui) Deployed in S5 wikidatawiki:  ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3315 wikidatawiki -e "show cre...
[10:16:26] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734037 (10Marostegui) S7 fawiki is deployed:  ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3317 fawiki -e "show create table or...
[10:39:16] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734049 (10Marostegui) S2 - ptwiki is deployed:  ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3312 ptwiki -e "show create table...
[12:07:09] <jynus>	 let's talk after lunch, marostegui about dbstore
[12:35:42] <marostegui>	 jynus: Sure - I was giving the speech :)
[12:35:45] <marostegui>	 But I am around now
[12:36:25] <marostegui>	 I am unpacking the S3 copy now in dbstore2001 to get the replication of S3 back up and I am going to import S1 now because the key might be the losing the session in mysql
[12:36:47] <jynus>	 when do you end working today? I want to know if I have time to go having lunch
[12:38:22] <marostegui>	 jynus: I will end late today, around 6.30 or so
[12:38:26] <marostegui>	 Plenty of time :)
[12:57:37] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2734243 (10Marostegui) I have restored S3 snapshot so S3 is replicating again.
[13:01:20] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734245 (10Marostegui) S2 - plwiki is deployed  ```   KEY `ores_model_status` (`oresm_name`,`oresm_is_current`) dbstore2002.codfw.wmnet   KEY...
[13:08:00] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734263 (10Marostegui) S2 - nlwiki is deployed:  ```   KEY `ores_model_status` (`oresm_name`,`oresm_is_current`) dbstore2002.codfw.wmnet   KE...
[13:08:06] <hoo>	 I see that db1070 has been depooled due to being overloaded. Please remember to add me (and aude) to tickets in case of Wikidata related problems
[13:08:18] <hoo>	 Just in case it's a software thing on our end
[13:09:42] <marostegui>	 hoo: It is not depooled, at least I do not see it depooled in db-eqiad.php
[13:10:35] <hoo>	 Oh, seems that's only in gerrit then
[13:10:54] <hoo>	 https://gerrit.wikimedia.org/r/317009
[13:11:30] <hoo>	 Anyway, I just wanted to encourage you to poke us in case of Wikidata problems
[13:11:53] <marostegui>	 hoo: Yeah, it was going to be depooled yesterday night when we had the issue, but in the end we didn't do it
[13:11:54] <hoo>	 There are also fixes in Wikibase which are not yet deployed :/
[13:12:06] <marostegui>	 Will do, thanks for the help :)
[13:16:19] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734269 (10Marostegui) S6 - ruwiki is deployed:  ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3316 ruwiki -e "show create table...
[13:23:09] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734271 (10Marostegui) S2 - trwiki is deployed:  ```   KEY `ores_model_status` (`oresm_name`,`oresm_is_current`) dbstore2002.codfw.wmnet   KE...
[13:23:40] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734272 (10Marostegui) @Ladsgroup I believe all of them are done, let me know if this can be closed.  Cheers Manuel.
[13:28:31] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2702076 (10Ladsgroup) 05Open>03Resolved
[13:34:47] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734302 (10Ladsgroup) Thanks!
[14:34:19] <jynus>	 was this your schema change? or do you know what could it be? https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?from=1477041778432&to=1477042689220&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-group=parsercache&var-shard=All&var-role=All
[14:35:13] <jynus>	 350 millon rows written per second is a lot of them
[14:35:14] <marostegui>	 checking
[14:35:59] <marostegui>	 mmmm
[14:36:00] <jynus>	 maybe it was me with the events?
[14:36:04] <marostegui>	 The table was really small
[14:36:12] <marostegui>	 The alters were taking 1 sec
[14:36:15] <jynus>	 but I did not write so many rows either
[14:36:43] <marostegui>	 It does match the shards I was touching
[14:37:14] <jynus>	 no, you touched only 3-4 right?
[14:37:39] <jynus>	 this is s1-s7 but not x1
[14:37:48] <marostegui>	 No, S1,S2,S5,S6,S7
[14:38:30] <marostegui>	 But the table is tiny, less than 10 rows
[14:38:37] <jynus>	 no, cleary it is not hta
[14:39:07] <jynus>	 I will continue investigating, it is a lots of rows written
[14:39:41] <marostegui>	 I can help with the investigation if you like
[14:40:58] <jynus>	 well, any idea where to start?
[14:41:09] <jynus>	 I am looking at individual servers
[14:41:15] <jynus>	 and do not see such a pattern
[14:43:50] <marostegui>	 Yeah, I was checking that too :|
[14:44:11] <marostegui>	 There is not even spike on traffic
[14:44:54] <jynus>	 this is the only relevant SAL: 09:13 jynus: reviewing and applying new watchdog events to all core dbs T148790
[14:44:54] <stashbot>	 T148790: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790
[14:45:21] <marostegui>	 	•	09:35 marostegui: Deploying schema change S1 enwiki.ores_model in eqiad - T147734
[14:45:22] <stashbot>	 T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734
[14:45:32] <jynus>	 nope, it starts before that
[14:45:48] <jynus>	 9:23
[14:46:00] <marostegui>	 true
[14:46:07] <marostegui>	 Did you touch those specific shards only?
[14:46:35] <marostegui>	 s2 starts at around 9:26 too
[14:46:42] <jynus>	 no, I touched x1, too
[14:46:52] <jynus>	 but I touched all of them s1-s7
[14:47:23] <jynus>	 I am going to check binary logs
[14:47:31] <marostegui>	 I am checking s2 ones
[14:50:21] <jynus>	 lots of moduledeps calls
[14:50:25] <jynus>	 but nothing abnormal
[14:50:39] <jynus>	 and size-wise, the logs have normal ones
[14:50:55] <marostegui>	 Yes, and there are not many logs generated today
[14:50:58] <marostegui>	 normal amount of logs
[14:50:58] <jynus>	 if there was actual writes, they would have grown enourmously
[14:51:15] <jynus>	 could it be some kind of metric problem?
[14:52:01] <jynus>	 the check is handler=~"(write|update|delete)
[14:53:31] <jynus>	 the master has written less than 8000M rows since it restarted
[14:53:54] <jynus>	 this is probably a single server reseting its stats
[14:54:05] <jynus>	 overflowing or something
[14:54:26] <marostegui>	 I do not see anything weird on S2 master either
[14:54:28] <jynus>	 returning 0 and then a lot of rows
[14:58:46] <marostegui>	 Is it normal to have the module_deps replace query at the same time in all the shards?
[15:06:00] <jynus>	 depending on the definition of normal
[15:06:05] <jynus>	 normal-wmf, yes
[15:06:11] <jynus>	 :-)
[15:06:30] <jynus>	 I've gone over the 92 servers on eqiad
[15:06:34] <jynus>	 and it is not one of them
[15:06:39] <jynus>	 it must be the sum, then
[15:07:54] <jynus>	 or some kind of prometheus restart
[15:08:17] <marostegui>	 Maybe asking filipo?
[15:08:21] <jynus>	 codfw has the same profile
[15:08:28] <jynus>	 but 20 minutes later
[15:08:36] <jynus>	 that looks like prometheus, not mysql
[15:08:38] <jynus>	 yes
[15:09:02] <jynus>	 ^godog
[15:09:44] <jynus>	 could it be also grafana, as in, the expression is not correct
[15:10:26] <jynus>	 but it is so perfectly aligned, that it cannot be puppet :-(
[15:16:05] <godog>	 reading up on the scrollback, looks like the rows written on s6 are out of whack?
[15:16:12] <godog>	 in codfw that is
[15:16:14] <godog>	 no, eqiad
[15:16:36] <jynus>	 if I change irate to rate, things do not go so crazy
[15:20:16] <godog>	 mhh that's true, also only update handler seems to go crazy
[15:29:03] <godog>	 ah yeah I see it now in the raw data, e.g. db1030 update handler count went down and then back up
[15:30:20] <godog>	 rate / irate operate on counters that only reset, do you know what could have happened today at around 9:32 on db1030 ?
[15:38:22] <jynus>	 godog, I was asking the same question :-)
[15:39:00] <jynus>	 mysql service did not update
[15:39:13] <jynus>	 and I assume you did not restart the service or anything?
[15:39:23] <jynus>	 the prometheus-mysql-exporter?
[15:40:21] <godog>	 no, though restarting the exporter shouldn't affect metrics or decrease them
[15:40:39] <jynus>	 but it is not a single server
[15:40:43] <jynus>	 all, in order of shard
[15:40:56] <jynus>	 all shards
[15:41:09] <jynus>	 if it was one server, just one glitch
[15:41:18] <jynus>	 but all?
[15:46:21] <godog>	 yeah it is odd, what could affect mysql counters like that? http://imgur.com/a/kwzF5
[15:51:38] <jynus>	 only a restart
[15:51:43] <jynus>	 which didn't happen
[15:52:15] <godog>	 ah, a restart should reset the counters to zero though?
[16:12:15] <jynus>	 yes, a mysql restart resets SHOW GLOBAL STATUS to 0
[16:12:19] <jynus>	 but that is ok
[16:12:51] <jynus>	 maybe I should not juse irate?
[16:15:41] <godog>	 irate I think is fine to use, it deals with counter resets back to zero, apparently rate does better in cases like the above
[16:18:50] <godog>	 I'm assuming that the data polled by prometheus is correct, i.e. that the counter for update handler effectively went down
[16:54:05] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2734744 (10Marostegui) I have started a new copy of s1 from db2055, copying also the `cfg` files to dbstore2001.  Also, I have added a small text to the documentation stating that we are not sure whether the me...
[16:58:52] <jynus>	 but "db1030 update handler count went down and then back up" is not possible under normal circunstances
[16:59:00] <jynus>	 unless that server crashed
[16:59:09] <jynus>	 even if it crased, they would be reset to 0
[16:59:15] <jynus>	 not to the previous state
[16:59:26] <jynus>	 I am not saying it is prometheus
[16:59:38] <jynus>	 but it could be a bug with the exporter or the dbs themselves
[16:59:57] <jynus>	 or alternatively, something very wrong happened
[17:25:38] <godog>	 jynus: indeed, also the rate itself changed significantly after it freaked out
[17:26:21] <jynus>	 it is not a big issue
[17:26:36] <jynus>	 but imagine the surprise when I saw 250 million writes a second!
[17:26:38] <jynus>	 :-)
[17:26:43] <jynus>	 it can wait
[17:27:13] <jynus>	 I have still to do some extra fixes for everthing prometheus
[17:27:29] <godog>	 hehehe we'd need some more machines for 250M writes/s
[17:28:52] <godog>	 I'm off, enjoy the weekend!
[20:36:24] <wikibugs>	 10DBA, 10MediaWiki-extensions-Linter: DBA review of Linter extension - https://phabricator.wikimedia.org/T148866#2735236 (10Legoktm)