[08:33:33] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) I have been taking a look at these indexes on enwiki, and we have two indexes in production that ar...
[08:52:09] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui)
[09:02:49] <jynus>	 read_only: "False", expected "True" on staging
[09:03:02] <marostegui>	 yep
[09:03:04] <marostegui>	 I am aware
[09:03:20] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491408/
[09:03:24] <marostegui>	 I started taking a look yesterday
[09:04:42] <jynus>	 maybe have it default to 1 so it doesn't need so many changes on unrelated code?
[09:05:03] <marostegui>	 yeah, but I guess we need to make that default parametrizable?
[09:05:12] <jynus>	 ?
[09:05:21] <marostegui>	 ?
[09:05:29] <marostegui>	 staging cannot be read_only
[09:05:48] <jynus>	 define mariadb::instance ( $read_only = 1,
[09:06:01] <jynus>	 then you discard the changes that are not staging
[09:06:35] <marostegui>	 not sure what you mean: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491408/11/modules/profile/manifests/mariadb/dbstore_multiinstance.pp ?
[09:06:39] <jynus>	 e.g. core/multiinstance.pp is untouched
[09:06:53] <marostegui>	 line 135
[09:06:57] <jynus>	 dbstore_multiinstance.pp only has that line
[09:07:05] <jynus>	 the others are not touched
[09:07:30] <marostegui>	 so you mean just commiting line 135?
[09:07:44] <jynus>	 well, and the define and the template
[09:08:45] <marostegui>	 ah I think I know what you mean
[09:09:11] <jynus>	 :-)
[09:09:23] <marostegui>	 let me see put a patchset to see if I got you rigfht
[09:09:24] <marostegui>	 right
[09:13:48] <marostegui>	 is this kinda what you meant? https://gerrit.wikimedia.org/r/491713
[09:14:56] <jynus>	 yes, much simpler
[09:15:03] <jynus>	 test it however on the compiler
[09:15:06] <marostegui>	 yeah
[09:15:08] <marostegui>	 doing it now
[09:15:17] <marostegui>	 and indeed, a lot more simplier, thanks for the advise
[09:22:37] <marostegui>	 The compiler looks good! \o/
[09:22:41] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491713/
[09:42:51] <marostegui>	 ha!
[09:43:12] <marostegui>	 The problem is also that /etc/nagios/nrpe.d/check_mariadb_slave_sql_state_staging.cfg has —check_read_only=1 hardcoded
[09:45:56] <jynus>	 ?
[09:46:34] <marostegui>	 from what I am seeing $port on modules/mariadb/manifests/monitor_readonly.pp is being done correctly, but not the $read_only
[09:46:51] <jynus>	 check_mariadb_slave_sql_state doesn't check read only
[09:47:05] <marostegui>	 sorry: check_mariadb_read_only_staging.cfg
[09:47:12] <marostegui>	 copy&paste error :)
[09:47:30] <jynus>	 did you run puppet there? it has to run first on the host and then on icinga
[09:47:48] <marostegui>	 I did 
[09:47:55] <marostegui>	 (also on icinga)
[09:48:03] <jynus>	 I know it is not hardcoded because we check that it is 0 on the core masters
[09:48:11] <marostegui>	 let me re-schedule the check on icinga
[09:48:24] <jynus>	 or something else is missing
[09:49:08] <marostegui>	 it is hardocoded, or at least check_mariadb_read_only_staging.cfg didn't change it to 0
[09:49:14] <jynus>	 oh, you forgeo to change it on puppet
[09:49:29] <jynus>	 read_only => 1,
[09:49:31] <marostegui>	 Did I?
[09:49:42] <jynus>	 it should use the variable
[09:49:51] <marostegui>	 but I wonder why $port is being used correctly
[09:50:01] <jynus>	 port says $port
[09:50:07] <jynus>	 read_only says one
[09:50:13] <jynus>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491713/4/modules/mariadb/manifests/instance.pp
[09:50:18] <jynus>	 ^line 71
[09:50:28] <marostegui>	 aaaah
[09:50:29] <jynus>	 you reverted too much :-D
[09:50:35] <marostegui>	 I was checking modules/mariadb/manifests/monitor_readonly.pp
[09:50:38] <marostegui>	 haha indeed
[09:54:52] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491720/ now it looks good!
[09:55:55] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui)
[09:57:05] <jynus>	 es backups finished
[09:57:15] <marostegui>	 thanks for the help with the read-only btw
[09:57:27] <marostegui>	 a lot simplier now
[09:57:52] <marostegui>	 how long did it take?
[09:59:00] <jynus>	 ?
[09:59:09] <marostegui>	 the es backup
[09:59:10] <jynus>	 what do you mean?
[09:59:23] <jynus>	 oh, sorry, got thrown out of context
[09:59:27] <marostegui>	 haha
[09:59:38] <jynus>	 I was like, but you did that!
[10:00:08] <jynus>	 I think you can sort-of see it here https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-2d&to=now&var-server=es2003&var-datasource=codfw%20prometheus%2Fops&var-cluster=mysql
[10:00:08] <marostegui>	 Actually, you got me out of a loop where I was getting deeper and deeper on the change and not seeing simple things :-)
[10:00:33] <jynus>	 so your change was maybe better, but my suggestion was simpler which minimizes risk
[10:00:49] <marostegui>	 ah nice, around 12h the backup took
[10:02:41] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) db1067 (s1 master) has too much concurrency to let the alter go thru, I will try a few more times before givin...
[10:02:50] <jynus>	 other backups seem mostly done except s8 on codfw
[10:03:34] <marostegui>	 no issues with the path then?
[10:04:24] <jynus>	 non-so far, I discovered a bug on gathering stats but I fixed it on time on tuesday
[10:04:45] <jynus>	 as it used to be all backups are dumps and now there are dumps and shnapshots
[10:04:47] <marostegui>	 cool
[10:04:51] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) s3 eqiad  [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1004 [] dbstore1002 [] db1124 [] db1123 [] db109...
[10:04:56] <marostegui>	 ah right!
[10:05:11] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui)
[10:05:33] <marostegui>	 should we have an specific way to differentiate those on the backups table?
[10:05:41] <jynus>	 it is
[10:05:52] <marostegui>	 ah I see it now
[10:06:01] <marostegui>	 | 619 | snapshot.s1-test.2019-01-22--11-39-08 | ongoing  | dbstore1002.eqiad.wmnet:3311 | dbstore1001.eqiad.wmnet | snapshot | s1-test | 2019-01-22 11:39:08 | NULL                |         NULL |
[10:06:38] <jynus>	 s7 is also ongoing
[10:08:06] <jynus>	 I think now I am going to finish the pieces to allow post-processing and data-gathering on a binary backup
[10:26:13] <jynus>	 if you need help with db1067, I can have a look
[10:26:49] <marostegui>	 don't worry, I will try a few more times, it basically has too much concurrency
[10:49:56] <jynus>	 2 correctable memory errors on x1-master (to keep an eye)
[10:51:01] <marostegui>	 https://phabricator.wikimedia.org/T201133
[10:51:23] <jynus>	 I see
[10:57:03] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) Just for the record ` db1069  Memory correctable errors -EDAC- WARNING 2019-02-20 10:45:24 2d 19h 28m 54s 3/3 2 ge 2 `
[11:54:26] <wikibugs>	 10DBA, 10Wikimedia-Site-requests, 10Serbian-Sites, 10Wikimedia-maintenance-script-run: Mass bigdeletion scheduled for sr.wikinews - https://phabricator.wikimedia.org/T212346 (10Zoranzoki21) >>! In T212346#4957919, @MarcoAurelio wrote: > Could I have my botflag at sr.wikinews removed, please? Until T122705...
[15:31:59] <wikibugs>	 10DBA, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio: Global rename of The_Photographer → Wilfredor: supervision needed - https://phabricator.wikimedia.org/T215107 (10MarcoAurelio) >>! In T215107#4963000, @Wilfredor wrote: > How much it could take?  Hello. I am not sure. There's a prob...
[16:23:32] <jynus>	 question- I ask to "prepare a backup" (xtrabackup --prepare, compress, gather statistics) but there happens to be 2 ongoing backups with the same section and type, what would you expect to see?
[16:23:43] <jynus>	 a) an error
[16:23:58] <jynus>	 b) assume we want to prepare the latest, the oldest?
[16:25:42] <jynus>	 normally I would opt for a, but note that b is not an unhandled errors, because the non-chosenn would be kept on ongoing, and we will check that the backup finished correctly anyway
[16:37:53] <marostegui>	 what would happen if B kicks in? it will overwrite the existing backup that is being processed?
[16:38:46] <jynus>	 this may need some context
[16:39:02] <jynus>	 the idea is to have a new option (I am working on it right now)
[16:39:29] <jynus>	 dump_section.py s3 --only-postprocess
[16:40:13] <jynus>	 which moves things from a ongoing to latest, gather stats, and compresses/prepares it if it is a snapshot
[16:40:53] <jynus>	 however, what if there are 2 dump.s3.<date1> dump.s3.<date2> directories?
[16:42:03] <jynus>	 alternatively, I can do something like dump_section.py dump.s3.<date1> --only-postprocess
[16:42:43] <jynus>	 but I am too deep in devel so I cannot decide what makes more sense
[16:43:04] <jynus>	 or I can just give an error "there are several dumps, aborting"
[16:48:10] <jynus>	 so yes, it can "break" the backup if it is ongoing (in reality no, because it won't pass because it will lack metadata/snapshot files)
[16:48:22] <jynus>	 but that is true if it is run on a single file
[16:48:31] <jynus>	 I think I will generate an error
[16:49:00] <jynus>	 after all, you can also try to prepare a non-existing backup
[16:50:16] <marostegui>	 let me read back
[16:51:09] <marostegui>	 yeah, I prefer to generate an error rather than messing up (or potentially messing up) with an ongoing backup
[17:09:17] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Anomie) Is there any way to find which queries are using them? The `(16)` prefix on those strikes me as particu...
[17:14:32] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10jcrespo) >>! In T51199#4969053, @Anomie wrote: > Is there any way to find which queries are using them?  It sho...
[17:23:08] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar)
[17:24:08] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Marostegui) Broken storage?: ` Feb 18 13:24:54 mysqld[837]: InnoDB: Error number 5 means 'Input/output error'. `
[17:24:34] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) For #DBA , I will poke you on Thursday for some assistance. Not sure whether the Innodb database can be recovered.  dep...
[17:24:47] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) It is certainly being used for some queries, I can see this counter increasing: ` root@db1089.eqiad...
[17:27:21] <hashar>	 the beta cluster master database is dead (innodb corruption somehow)  that is https://phabricator.wikimedia.org/T216635
[17:27:29] <hashar>	 but I think it is too later for us european to look into it
[17:27:29] <marostegui>	 hashar: check my comment
[17:27:57] <hashar>	 maybe the disk / innodb tables are corrupted yeah ;((
[17:28:17] <marostegui>	 I would suggest to look at its storage, cause that error isn't looking great
[17:28:20] <hashar>	 krenair was looking at the slave , maybe ew can reuse that to repopulate a master
[17:28:25] <hashar>	 yeah :(
[17:28:33] <hashar>	 The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles!
[17:29:03] <hashar>	 I felt maybe one is out of date and maybe that can be recovered albeit with some data loss for the last operations
[17:29:03] <marostegui>	 I would focus on the I/O error though
[17:34:56] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) ` umount /srv fsck /srv <fixed bunch of things> mount /srv `  I restarted mysql. Same I/O error. Maybe the disk is corr...
[17:37:00] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Marostegui) Anything on `dmesg`? Can you do a `touch /srv/test`?
[17:51:07] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) Ah dmesg!  [Mon Feb 18 13:24:48 2019] EXT4-fs (dm-0): warning: mounting fs with errors, running e2fsck is recommended [...
[17:51:18] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Anomie) I don't doubt that it is being used. But as we've seen elsewhere the planner can sometimes make strange...
[17:54:19] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) I did a fsck again, but I am afraid the partitions are corrupted beyond control :/    head /usr/share/prometheus-node-e...
[20:18:41] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) After I have done the fsck the I/O error is gone. It was:  Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [ERROR] InnoDB:...
[20:19:45] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Addshore) p:05Triage→03High Marking as High (as I did with the other ticket) as beta is broken until this is fixed afaik.
[20:28:01] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) Then I am trying: ` name=/etc/my.cnf [mysqld] innodb-force-recovery = 1 `  Then `sudo systemctl start mariadb` which sp...
[20:45:28] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Marostegui) Data looks very corrupted. At this point the best option is to rebuild that host from the slave.
[20:45:49] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) Without innodb-force-recovery = 1, I get the same P8111 by simply moving enwiki/archive files. So that table definitely...
[20:50:09] <hashar>	 marostegui: yeah I got a fault when reading some page :(((
[20:54:28] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) 05Open→03Stalled Not much to do here. The slave deployment-db04 eventually managed to start mysql and new instances...
[20:54:34] <hashar>	 so yeah will wait for the slave to recover eventually ;)
[20:55:47] * hashar waves
[22:27:26] <wikibugs>	 10DBA, 10MediaWiki-API: API problem with usercontribs - https://phabricator.wikimedia.org/T216656 (10Anomie) **TL;DR:** The thing to do here is probably to stop using the 'contributions' replica group in ApiQueryUserContribs, at least when we're not using `rev_user` to specify the user.  The query here is ` la...
[22:29:06] <wikibugs>	 10DBA, 10MediaWiki-API: API problem with usercontribs - https://phabricator.wikimedia.org/T216656 (10Anomie) Speaking of the partitioning on the 'contributions' group, I wonder whether we'll start running into similar issues in the other things using that group once we flip the actor migration switch so querie...