[05:29:35] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T174857#3583074 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This is all good now, thanks a lot Chris!  ``` root@db1059:~# megacli -LDPDInfo -aAll  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Ta...
[06:38:09] <wikibugs>	 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3583165 (10Marostegui) p:05Triage>03Normal
[07:21:17] <Amir1>	 marostegui: hey, is it normal? https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-1h&to=now
[07:21:24] <marostegui>	 checking
[07:21:39] <marostegui>	 i was just checkking that actually
[07:21:41] <Amir1>	 db1095
[07:21:46] <marostegui>	 yep
[07:21:48] <marostegui>	 I was on that host right now
[07:22:00] <marostegui>	 it is not "production" it is a sanitarium host, which is used to filter data to labs
[07:24:19] <Amir1>	 thanks. I was thinking if I need to disable the cronjob
[07:25:12] <marostegui>	 I think it is coming from a massive truncate I am doing on s3 (which db1095 also replicates), and might be too heavy for it to cope with it without more throttling
[07:25:15] <marostegui>	 I stopped it and it is now decreasing lag
[07:25:17] <marostegui>	 so looks related
[07:28:57] <marostegui>	 it is back to 0, I am going to add more throttling
[08:16:40] <wikibugs>	 10DBA, 10Operations, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3583335 (10jcrespo) Not yet, this is still in use.
[08:28:34] <wikibugs>	 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3583345 (10Marostegui) I will migrate db1100 to file per table once the copy is done. db1049 was not using it.
[08:36:25] <jynus>	 wait, if T150306 is to be done
[08:36:25] <stashbot>	 T150306: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306
[08:36:35] <jynus>	 how come we are still writing to it?
[08:36:52] <marostegui>	 ?
[08:37:02] <jynus>	 or is the truncate what was replicated?
[08:37:03] <marostegui>	 what do you mean we are still writing?
[08:37:06] <marostegui>	 the truncate
[08:37:08] <jynus>	 ah
[08:37:10] <jynus>	 ok
[08:37:25] <jynus>	 I thought it was rows being written to mediawiki
[08:37:33] <jynus>	 *from
[08:37:35] <marostegui>	 ah no no :)
[08:38:16] <jynus>	 is that table filtered on labs? does it need special attention there?
[08:39:29] <jynus>	 and a second question, did it only fail on bawiktionary, but worked on other wikis on dbstore1002 with no problem?
[08:39:38] <marostegui>	 no special attention on labs
[08:39:59] <marostegui>	 regarding bawiktionary, only that one, but it would have failed on 10 more, which I have created and will later drop, to avoid breaking replication
[08:40:44] <jynus>	 interesting
[09:03:02] <jynus>	 I am going to merge gerrit:362217
[09:03:16] <jynus>	 but I will disable puppet on all important hosts
[09:03:46] <marostegui>	 i see
[09:03:50] <marostegui>	 go for it
[09:08:03] <jynus>	 it should be safe, but I am not going to risk a network outage
[09:08:16] <marostegui>	 hehe yeah
[09:10:08] <jynus>	 DNS query for '10.64.0.15' failed: NXDOMAIN
[09:10:23] <marostegui>	 from where
[09:10:29] <marostegui>	 because neodymium resolves it
[09:10:32] <jynus>	 db1066
[09:10:48] <jynus>	 apparently db1066 doesn't
[09:11:05] <jynus>	 is it thinking that is a dns?
[09:11:08] <marostegui>	 root@db1066:~# host 10.64.0.15
[09:11:08] <marostegui>	 15.0.64.10.in-addr.arpa domain name pointer db1011.eqiad.wmnet.
[09:11:08] <marostegui>	 root@db1066:~# host db1011.eqiad.wmnet.
[09:11:08] <marostegui>	 db1011.eqiad.wmnet has address 10.64.0.15
[09:11:09] <marostegui>	 ?
[09:11:14] <jynus>	 and it is trying to direct-resolve an ip?
[09:12:33] <jynus>	 s/@resolve(($MYSQL_ROOT_CLIENTS))/$MYSQL_ROOT_CLIENTS/ I guess?
[09:13:35] <marostegui>	 let's try yes
[09:13:38] <marostegui>	 it is strange
[09:15:53] <wikibugs>	 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3583460 (10Marostegui) Dropped from all the shards except from s3, which is being done now slowly
[09:24:11] <jynus>	 it is working and I do not see any network drop on db1066, so I am allowing puppet everywhere
[09:24:23] <marostegui>	 \o/
[09:24:55] <wikibugs>	 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3583486 (10Marostegui) a:03Marostegui
[09:33:26] <elukey>	 hello people, for https://phabricator.wikimedia.org/T174815 I'd like to increase the batch size of eventlogging_sync.sh from 1000 to 5k/10k, anything against it?
[09:34:18] <marostegui>	 how would that affect dbstore1002 disk issues?
[09:35:18] <jynus>	 elukey: if you hit max package or statement size, you could run into problems
[09:36:14] <elukey>	 marostegui: it wouldn't, it will only increase the maximum amount of inserts done in once IIUC
[09:36:44] <elukey>	 jynus: super ignorant about this part, can you explain that to me a bit more? 
[09:37:18] <jynus>	 there is a max limit on the amount of query text and results that you can do per query
[09:37:35] <marostegui>	 elukey: good, same rate but different batch size. right!
[09:37:37] <jynus>	 be aware that if you surpass it, query will fail
[09:37:59] <elukey>	 jynus: ahh okok
[09:38:23] <jynus>	 https://dev.mysql.com/doc/refman/5.7/en/packet-too-large.html
[09:38:59] <elukey>	 all right, will carefully watch logs
[09:41:43] <jynus>	 I am not saying it is going to happen
[09:41:55] <jynus>	 but a) it is the typical issue when making things larger
[09:42:08] <jynus>	 b) I think there was some issues in the past on analytics hosts
[09:42:23] <jynus>	 but I do not remember specifics
[09:42:37] <elukey>	 yep got it, it is a good heads up, I was looking for this kind of advice
[09:43:01] <jynus>	 there is a "backfill" script from db1046 to the replicas
[09:43:23] <jynus>	 running that and seeing no row has to be backfilled would be a good indication no row is being lost
[09:44:46] * elukey learns about a new script
[09:45:09] <elukey>	 those hosts are a neverending source of surprises
[09:45:34] <elukey>	 I can imagine all the "fun" discovering them when nothing was puppetized :D
[09:46:05] <jynus>	 well, I actually did not change much there because it was a minefiled
[09:46:17] <jynus>	 I just puppetized what was already running there
[09:46:30] <jynus>	 even knowing it was far from perfect
[09:47:19] <jynus>	 I wonder if the script would be necessary anymore if we upgrade to 10.1 and enable (with help of small app fixes) parallel replication
[09:47:51] <jynus>	 although I think tokudb has issues with parralel replication
[09:49:44] <jynus>	 I am about to enable puppet on all hosts again
[10:00:17] <moritzm>	 ok to restart apache on dbmonitor for a security update?
[10:00:45] <jynus>	 yes
[10:01:02] <moritzm>	 k, going ahead
[10:01:38] <jynus>	 I think it is going to be ok everytime, unless we are in the middle of a master failover or similar maintenance
[10:01:54] <jynus>	 in which monitoring is very important
[10:02:29] <jynus>	 or we are upgrading tendril or something like that
[10:04:07] <jynus>	 s5 replication lag didn't recover on time on dbstore1001 for the following backup cycle
[13:23:16] <wikibugs>	 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3584265 (10jcrespo) db1053 done, db1064 next.
[13:24:44] <marostegui>	 \o/
[14:30:47] <wikibugs>	 10DBA, 10Discovery, 10GeoData, 10Maps-Sprint: Removal of {{#coordinates:}} leaves DB entries behind - https://phabricator.wikimedia.org/T143366#2566073 (10TheDJ) The coordinates parser function is within the scope of the GeoData extension.
[15:01:04] <wikibugs>	 10Blocked-on-schema-change, 10Security-Reviews: Identify the source of WHOIS data, the retrieval method, and update frequency - https://phabricator.wikimedia.org/T175160#3584584 (10Huji)
[15:36:46] <wikibugs>	 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3584710 (10Marostegui) Truncation still happening on s3. I have throttled it quite a bit, because db1095 (sanitarium) was struggling to be able to replicate without any delay.
[15:45:52] <wikibugs>	 10Blocked-on-schema-change, 10Security-Reviews: Identify the source of WHOIS data, the retrieval method, and update frequency - https://phabricator.wikimedia.org/T175160#3584564 (10Marostegui) Hi,   Is this meant to be a database schema change? https://wikitech.wikimedia.org/wiki/Schema_changes  If it is: can...
[15:51:17] <wikibugs>	 10Blocked-on-schema-change, 10Security-Reviews: Identify the source of WHOIS data, the retrieval method, and update frequency - https://phabricator.wikimedia.org/T175160#3584785 (10Huji) For IP subnet information, we could parse the data directly from RIRs. The data is available for free (in both senses of the...
[18:35:05] <Reedy>	 marostegui: jynus It'd be nice to go over some extension tables to see what updates they need... PK, unsigned etc etc
[18:35:21] <Reedy>	 I might file a meta task for looking over them
[18:37:01] <marostegui>	 Don't know how many tables without a PK we still have - not many I would say
[18:37:33] <Reedy>	 Or am I being naieve?
[18:37:45] <Reedy>	 When you're doing those checks.. You're actually checking the extension tables too?
[18:38:15] <Reedy>	 I haven't seen m?any tasks for extensions.. And I can't believe they're perfect basd on how core is :P
[18:38:19] <marostegui>	 Reedy: Don't really remember, as we started doing it like 7 months ago XD
[18:38:32] <marostegui>	 No no, probably they need lots of fixing I guess
[18:38:40] <jynus>	 Reedy: I am being conservative- let's fix core first
[18:38:45] <jynus>	 the rest can wait
[18:38:46] <Reedy>	 heh
[18:38:54] <marostegui>	 But don't know if we'd have time for it now
[18:39:02] <jynus>	 specially if we hit not-very-well maintened ones
[18:39:20] <jynus>	 I am not saying it shouldn't be done
[18:39:31] <jynus>	 I am just being realistic about my expectations
[18:39:35] <Reedy>	 of course, that's be weirdly hypocritical
[18:39:36] <Reedy>	 haha
[18:40:19] <jynus>	 the idea is once I added the "policy"
[18:40:39] <jynus>	 that anything either "officially sanctioned" or deployed to "wmf production"
[18:40:49] <jynus>	 should always have a PK
[18:40:58] <jynus>	 people have been slowly doing it
[18:41:46] <jynus>	 in some cases, when people asked for a schema change, I "convinced" them into adding them at the same time
[18:41:59] * marostegui loves that
[18:42:10] <Reedy>	 haha
[18:42:18] <Reedy>	 That's the perfect time to do it
[18:42:20] <jynus>	 so for the largests/more important (think wikibase, centralauth, flow, translation)
[18:42:28] <Reedy>	 Bit of CR time now saves potentially a lot of time later for yourselves
[18:42:28] <jynus>	 they are fixed or very close
[18:42:32] <jynus>	 yep
[18:42:42] <jynus>	 that is why I went "political" first
[18:42:44] <Reedy>	 heh
[18:42:46] <Reedy>	 You've also got the
[18:42:54] <Reedy>	 "If you don't add it, this isn't getting deployed"
[18:42:58] <Reedy>	 It's a good way to get people to do stuff
[18:43:00] <marostegui>	 XDDDDD
[18:43:03] <jynus>	 so I have a useless, but people seemingly respect "policy"
[18:43:15] <Reedy>	 Especially, when it's not an unrreasonable request
[18:43:20] <jynus>	 usless in terms that I said "the policy says X"
[18:43:35] <jynus>	 and it has a magic effect
[18:43:39] <marostegui>	 It is soooo painful to add a PK to a table that is already being used without one...pff
[18:43:57] <jynus>	 even if it was just added by me on the wiki and "discussed"
[18:44:22] <jynus>	 and nobody was going to object to "not add PKs"
[18:45:15] <jynus>	 the problem is not only that, maintenance without PK is almost impossible
[18:45:26] <jynus>	 and in most cases it is a design error, too
[18:45:37] <jynus>	 for example, the query cache one
[18:46:05] <jynus>	 I am not saying that a PK would fix that, but if it had been well designed, it wouldn't be the mess that it was now
[18:46:33] <marostegui>	 And even if it is not the best PK, by having one, you can easily change it :)
[18:46:33] <Reedy>	 heh
[18:46:35] <jynus>	 Reedy: you will be happy to now that I have compared almost every row of every image table of every myserver
[18:46:38] <jynus>	 *know
[18:46:48] <jynus>	 and finnaly, for the first time in 10 years
[18:46:56] <jynus>	 we have the same data on all hosts!
[18:47:09] <marostegui>	 jynus: do you know all the images in commonswiki by memory now?
[18:47:34] <jynus>	 not the images, but the chunks and the "events"
[18:47:42] <jynus>	 the 2005 bug
[18:47:47] <Reedy>	 same data on all hosts is overrated
[18:47:47] <jynus>	 the 2014 bug
[18:47:52] <jynus>	 and the 2015 bug
[18:47:59] <marostegui>	 what was that?
[18:48:02] <jynus>	 on moving/deleting images
[18:48:14] <jynus>	 discrepancies cluster around certain dates
[18:48:17] <marostegui>	 Reedy: sure, no more random page feature needed!
[18:48:32] <jynus>	 and in some cases, I can pinpoiint to specific code deploys/bugs
[18:48:40] <Reedy>	 marostegui: our not so random random pages...
[18:48:57] <jynus>	 the funny thing is that in most cases, there was not data loss
[18:49:07] <jynus>	 but there were not-deleted redirects
[18:49:17] <jynus>	 and same rows with different ids
[19:06:22] <wikibugs>	 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3585986 (10Umherirrender)
[19:30:25] <Reedy>	 brad just merged my patch for making various columns unsigned
[19:30:26] <Reedy>	 https://gerrit.wikimedia.org/r/350437
[19:35:40] <jynus>	 cool
[19:39:04] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#3586122 (10Reedy) https://gerrit.wikimedia.org/r/#/c/350437/ is merged, so good to go whenever you are :)
[19:39:44] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#3586140 (10Reedy)
[19:46:07] <wikibugs>	 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3586159 (10jcrespo) I am quite confident of image now (I didn't check every host, but codfw for example seems to be in a much better state), checking filearchive next.
[20:47:56] <wikibugs>	 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3586457 (10Nuria) Looks like it is all good:   select wiki_db, count(*) from wmf.mediawiki_history where snapshot = "2017-08" and wiki_d...
[20:48:12] <wikibugs>	 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3586458 (10Nuria) 05Open>03Resolved
[21:26:36] <wikibugs>	 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3586571 (10Neil_P._Quinn_WMF) @Nuria, yes, the queries where I originally discovered this work now. Thank you!
[23:18:31] <wikibugs>	 10DBA, 10Data-Services, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3586971 (10Dispenser) From arwiki: [[https://ar.wikipedia.org/wiki/File:%D8%A7%D9%84%D9%84%D9%87_%D8%B9%D8%B2_%D9%88%D8%AC%D9%84.png|File:الله عز وجل.png]]. Deleted in Nov 2015. ```lang=bash $ sql arwiki_p...