[05:50:51] es2015 still has the sysctl error :-( [05:50:57] Maybe we should give it a reboot? [05:52:24] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2733733 (10Marostegui) Maybe this server still needs a reboot, as it has been having the icinga warning about not be... [05:56:31] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733734 (10Marostegui) p:05Triage>03Normal a:03Marostegui [07:17:34] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2733752 (10Marostegui) I had a chat with Jaime yesterday about the past issues with the wildcard-based privileges and it is certainily worrying. Probably... [07:46:13] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2733815 (10Marostegui) So the tricky part is this one: ``` DEFINER=viewmaster ``` As per MySQL documentation: ``` If you specify the DEFINER clause, y... [08:18:12] 10DBA: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733838 (10jcrespo) a:03jcrespo [08:33:35] 10DBA: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733846 (10jcrespo) Applied to db1070- the reason why it was not active is because it was a former master; so it had the configuration for a normal master. Trying to detect other p... [08:53:56] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733873 (10Marostegui) Hi, I have deployed this new index in `db2034.codfw.wmnet` S1.enwiki and this is the final table schema: ``` MariaD... [08:57:10] 10DBA, 13Patch-For-Review: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733877 (10jcrespo) Pending hosts: * es hosts- it will not be applied * Old masters: needs to change its events to slaves * New masters: needs to change its e... [09:04:58] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733886 (10Ladsgroup) It should be as the same as the schema [[ https://github.com/wikimedia/mediawiki-extensions-ORES/blob/master/sql/ores_m... [09:08:14] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733888 (10Marostegui) >>! In T147734#2733886, @Ladsgroup wrote: > It should be as the same as the schema [[ https://github.com/wikimedia/med... [09:14:40] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733898 (10Ladsgroup) Thank you! > If you need it somewhere else faster, I can also do that for you. Nah, There is no need. Thanks [09:16:02] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733900 (10jcrespo) > Which looks the same except varchar turned into var binary. Which I don't know is expected or not. Yes, mediawiki at W... [09:16:38] Thanks jynus for the context :-) [09:17:21] ^this is the reason pt-table-checksum breaks on our env [09:17:35] for all but the latest version, where I sent a patch [09:17:48] Ah interesting [09:17:58] Will they release it? [09:18:29] on repo it is already, but I do not know if they have released it [09:18:57] the change is very easy- just a one line forcing the columns to be utf8 [09:18:58] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733901 (10Marostegui) Change applied to all S1 codfw hosts: ``` root@neodymium:/home/marostegui/git# for i in `cat software/dbtools/s1.host... [09:26:56] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733934 (10Ladsgroup) Thanks for explanation and the deployment. [09:43:08] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2733973 (10Marostegui) Deployed in S1.enwiki eqiad. So S1 is done: ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3311 enwiki -e... [09:49:15] 10DBA, 13Patch-For-Review: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790#2733991 (10jcrespo) 05Open>03Resolved I have applied master events to all masters, eqiad and codfw, slave events to all slaves (on the core s1-s7 shards).... [09:59:32] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734014 (10Marostegui) Deployed in S5 wikidatawiki: ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3315 wikidatawiki -e "show cre... [10:16:26] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734037 (10Marostegui) S7 fawiki is deployed: ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3317 fawiki -e "show create table or... [10:39:16] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734049 (10Marostegui) S2 - ptwiki is deployed: ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3312 ptwiki -e "show create table... [12:07:09] let's talk after lunch, marostegui about dbstore [12:35:42] jynus: Sure - I was giving the speech :) [12:35:45] But I am around now [12:36:25] I am unpacking the S3 copy now in dbstore2001 to get the replication of S3 back up and I am going to import S1 now because the key might be the losing the session in mysql [12:36:47] when do you end working today? I want to know if I have time to go having lunch [12:38:22] jynus: I will end late today, around 6.30 or so [12:38:26] Plenty of time :) [12:57:37] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2734243 (10Marostegui) I have restored S3 snapshot so S3 is replicating again. [13:01:20] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734245 (10Marostegui) S2 - plwiki is deployed ``` KEY `ores_model_status` (`oresm_name`,`oresm_is_current`) dbstore2002.codfw.wmnet KEY... [13:08:00] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734263 (10Marostegui) S2 - nlwiki is deployed: ``` KEY `ores_model_status` (`oresm_name`,`oresm_is_current`) dbstore2002.codfw.wmnet KE... [13:08:06] I see that db1070 has been depooled due to being overloaded. Please remember to add me (and aude) to tickets in case of Wikidata related problems [13:08:18] Just in case it's a software thing on our end [13:09:42] hoo: It is not depooled, at least I do not see it depooled in db-eqiad.php [13:10:35] Oh, seems that's only in gerrit then [13:10:54] https://gerrit.wikimedia.org/r/317009 [13:11:30] Anyway, I just wanted to encourage you to poke us in case of Wikidata problems [13:11:53] hoo: Yeah, it was going to be depooled yesterday night when we had the issue, but in the end we didn't do it [13:11:54] There are also fixes in Wikibase which are not yet deployed :/ [13:12:06] Will do, thanks for the help :) [13:16:19] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734269 (10Marostegui) S6 - ruwiki is deployed: ``` root@neodymium:/home/marostegui/git# mysql -hdb1069 -P3316 ruwiki -e "show create table... [13:23:09] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734271 (10Marostegui) S2 - trwiki is deployed: ``` KEY `ores_model_status` (`oresm_name`,`oresm_is_current`) dbstore2002.codfw.wmnet KE... [13:23:40] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734272 (10Marostegui) @Ladsgroup I believe all of them are done, let me know if this can be closed. Cheers Manuel. [13:28:31] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2702076 (10Ladsgroup) 05Open>03Resolved [13:34:47] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734#2734302 (10Ladsgroup) Thanks! [14:34:19] was this your schema change? or do you know what could it be? https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?from=1477041778432&to=1477042689220&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-group=parsercache&var-shard=All&var-role=All [14:35:13] 350 millon rows written per second is a lot of them [14:35:14] checking [14:35:59] mmmm [14:36:00] maybe it was me with the events? [14:36:04] The table was really small [14:36:12] The alters were taking 1 sec [14:36:15] but I did not write so many rows either [14:36:43] It does match the shards I was touching [14:37:14] no, you touched only 3-4 right? [14:37:39] this is s1-s7 but not x1 [14:37:48] No, S1,S2,S5,S6,S7 [14:38:30] But the table is tiny, less than 10 rows [14:38:37] no, cleary it is not hta [14:39:07] I will continue investigating, it is a lots of rows written [14:39:41] I can help with the investigation if you like [14:40:58] well, any idea where to start? [14:41:09] I am looking at individual servers [14:41:15] and do not see such a pattern [14:43:50] Yeah, I was checking that too :| [14:44:11] There is not even spike on traffic [14:44:54] this is the only relevant SAL: 09:13 jynus: reviewing and applying new watchdog events to all core dbs T148790 [14:44:54] T148790: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790 [14:45:21] • 09:35 marostegui: Deploying schema change S1 enwiki.ores_model in eqiad - T147734 [14:45:22] T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734 [14:45:32] nope, it starts before that [14:45:48] 9:23 [14:46:00] true [14:46:07] Did you touch those specific shards only? [14:46:35] s2 starts at around 9:26 too [14:46:42] no, I touched x1, too [14:46:52] but I touched all of them s1-s7 [14:47:23] I am going to check binary logs [14:47:31] I am checking s2 ones [14:50:21] lots of moduledeps calls [14:50:25] but nothing abnormal [14:50:39] and size-wise, the logs have normal ones [14:50:55] Yes, and there are not many logs generated today [14:50:58] normal amount of logs [14:50:58] if there was actual writes, they would have grown enourmously [14:51:15] could it be some kind of metric problem? [14:52:01] the check is handler=~"(write|update|delete) [14:53:31] the master has written less than 8000M rows since it restarted [14:53:54] this is probably a single server reseting its stats [14:54:05] overflowing or something [14:54:26] I do not see anything weird on S2 master either [14:54:28] returning 0 and then a lot of rows [14:58:46] Is it normal to have the module_deps replace query at the same time in all the shards? [15:06:00] depending on the definition of normal [15:06:05] normal-wmf, yes [15:06:11] :-) [15:06:30] I've gone over the 92 servers on eqiad [15:06:34] and it is not one of them [15:06:39] it must be the sum, then [15:07:54] or some kind of prometheus restart [15:08:17] Maybe asking filipo? [15:08:21] codfw has the same profile [15:08:28] but 20 minutes later [15:08:36] that looks like prometheus, not mysql [15:08:38] yes [15:09:02] ^godog [15:09:44] could it be also grafana, as in, the expression is not correct [15:10:26] but it is so perfectly aligned, that it cannot be puppet :-( [15:16:05] reading up on the scrollback, looks like the rows written on s6 are out of whack? [15:16:12] in codfw that is [15:16:14] no, eqiad [15:16:36] if I change irate to rate, things do not go so crazy [15:20:16] mhh that's true, also only update handler seems to go crazy [15:29:03] ah yeah I see it now in the raw data, e.g. db1030 update handler count went down and then back up [15:30:20] rate / irate operate on counters that only reset, do you know what could have happened today at around 9:32 on db1030 ? [15:38:22] godog, I was asking the same question :-) [15:39:00] mysql service did not update [15:39:13] and I assume you did not restart the service or anything? [15:39:23] the prometheus-mysql-exporter? [15:40:21] no, though restarting the exporter shouldn't affect metrics or decrease them [15:40:39] but it is not a single server [15:40:43] all, in order of shard [15:40:56] all shards [15:41:09] if it was one server, just one glitch [15:41:18] but all? [15:46:21] yeah it is odd, what could affect mysql counters like that? http://imgur.com/a/kwzF5 [15:51:38] only a restart [15:51:43] which didn't happen [15:52:15] ah, a restart should reset the counters to zero though? [16:12:15] yes, a mysql restart resets SHOW GLOBAL STATUS to 0 [16:12:19] but that is ok [16:12:51] maybe I should not juse irate? [16:15:41] irate I think is fine to use, it deals with counter resets back to zero, apparently rate does better in cases like the above [16:18:50] I'm assuming that the data polled by prometheus is correct, i.e. that the counter for update handler effectively went down [16:54:05] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2734744 (10Marostegui) I have started a new copy of s1 from db2055, copying also the `cfg` files to dbstore2001. Also, I have added a small text to the documentation stating that we are not sure whether the me... [16:58:52] but "db1030 update handler count went down and then back up" is not possible under normal circunstances [16:59:00] unless that server crashed [16:59:09] even if it crased, they would be reset to 0 [16:59:15] not to the previous state [16:59:26] I am not saying it is prometheus [16:59:38] but it could be a bug with the exporter or the dbs themselves [16:59:57] or alternatively, something very wrong happened [17:25:38] jynus: indeed, also the rate itself changed significantly after it freaked out [17:26:21] it is not a big issue [17:26:36] but imagine the surprise when I saw 250 million writes a second! [17:26:38] :-) [17:26:43] it can wait [17:27:13] I have still to do some extra fixes for everthing prometheus [17:27:29] hehehe we'd need some more machines for 250M writes/s [17:28:52] I'm off, enjoy the weekend! [20:36:24] 10DBA, 10MediaWiki-extensions-Linter: DBA review of Linter extension - https://phabricator.wikimedia.org/T148866#2735236 (10Legoktm)