[05:26:28] <wikibugs_>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3547688 (10Marostegui) >>! In T173859#3545916, @Steinsplitter wrote: >>>! In T173859#3544614, @Marostegui wrote: >> @MarcoAurelio See: T172207#3544611 >>...
[05:26:46] <wikibugs_>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3547689 (10Marostegui) 05stalled>03Open
[05:31:42] <wikibugs_>	 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3547704 (10Marostegui)
[05:37:39] <wikibugs_>	 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3547705 (10Marostegui) >>! In T173891#3546800, @kaldari wrote: > Ran `foreachwiki sql.php /srv/mediawiki/php/maintenance/archives/patch-ip_change...
[06:01:25] <wikibugs_>	 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3547706 (10Marostegui) >>! In T173891#3545400, @kaldari wrote: >>if possible try to avoid assigning it directly to Jaime > I assigned it to Jaime...
[06:02:19] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3547707 (10Marostegui) >>! In T173365#3545014, @Cmjohnson wrote: > @Marostegui The ssd has been replaced. Please resolve after rebuild  Should we close this ticket and create a new one f...
[06:07:36] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3547710 (10Marostegui) a:03Cmjohnson This host is now ready to be decommissioned and ready for @Cmjohnson do the DC-Ops part
[08:04:05] <Amir1>	 jynus: hey, this is ready for merge: https://gerrit.wikimedia.org/r/#/c/370626/7
[08:04:17] <jynus>	 I was checking it out
[08:04:24] <Amir1>	 I have been monitoring everything and it's fine 
[08:04:38] <jynus>	 https://puppet-compiler.wmflabs.org/compiler02/7589/terbium.eqiad.wmnet/
[08:04:53] <jynus>	 I am not convinced about using /tmpp
[08:05:19] <Amir1>	 what do you suggest?
[08:05:44] <jynus>	 a dir only www-data can write
[08:06:36] <Amir1>	  /home/www-data ?
[08:08:32] <jynus>	 also, the cron parameters are wrong, unless you want that to execute only on sundays at 3:30
[08:08:49] <Amir1>	 okay. Let me fix that
[08:10:33] <Amir1>	 jynus: Are you sure? I checked the top cronjobs and they are being ran every hour 
[08:10:39] <Amir1>	 *above cronjobs
[08:12:52] <jynus>	 I sent you https://puppet-compiler.wmflabs.org/compiler02/7589/terbium.eqiad.wmnet/
[08:13:39] <jynus>	 others have '"minute": "*/3"'
[08:13:58] <jynus>	 or minute: [0, 15, 30, 45]
[08:14:01] <jynus>	 yours has
[08:14:18] <jynus>	  "minute": "30",           "hour": "3",           "weekday": "0"
[08:14:39] <jynus>	 check it for yourself on that link
[08:19:30] <jynus>	 let's move the log to /var/log/wikidata
[08:19:35] <Amir1>	 let me try to learn that
[08:19:41] <Amir1>	 jynus: Already did that
[08:19:56] <Amir1>	 I'm trying to fix the timing and then upload the new patchset
[08:20:21] <jynus>	 I will move the (assume) canonical on tmp to there when you tell me
[08:21:02] <jynus>	 did you check the issue with logging frequency?
[08:21:32] <Amir1>	 yeah, it will write once in four/three minutes
[08:21:35] <Amir1>	 I guess that's okay
[08:21:40] <jynus>	 cool yes
[08:21:55] <jynus>	 we can leave it running for more than an hour if that will be a problem
[08:22:09] <jynus>	 will it skip rows if they are already done?
[08:22:49] <jynus>	 so make the changes, and you can rebuild the puppet compiler job yourself
[08:23:22] <jynus>	 at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/7589/
[08:23:28] <Amir1>	 yeah it will skip them but report (so it will write faster for the first 500K rows)
[08:23:30] <jynus>	 (logged)
[08:23:52] <jynus>	 oh, we can start on the 500Kth
[08:23:56] <jynus>	 that is no problem
[08:24:22] <jynus>	 the problem with tmp is that it can easily be  overwritten
[08:24:45] <jynus>	 so better with the other logs
[08:25:00] <jynus>	 if there is any problem, we can remove or modify it
[08:25:26] <jynus>	 but better than other random process overwritting it
[08:27:25] <Amir1>	 jynus: agreed, but regarding the timing, I couldn't find the problem or the way to fix it, https://docs.puppet.com/puppet/latest/types/cron.html says removed parameters are "*" 
[08:27:49] <Amir1>	 and couldn't find the cron in the puppet complier 
[08:28:12] <jynus>	 maybe the behaviour is different in our version
[08:28:41] <jynus>	 I would, out of carefulness, be explicit and put all of those as '*'
[08:28:58] <Amir1>	 okay
[08:29:08] <jynus>	 maybe it is a compiler limitation, but better be sure- I remember having reverted several cron deployments
[08:29:24] <jynus>	 because I am 100% sure the production behaviour is the same than the compilation
[08:29:48] <jynus>	 so just hour => '*', weekday => '*' should be enough
[08:30:45] <Amir1>	 jynus: okay, uploaded the new patchset, please check :)
[08:31:07] <jynus>	 https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/7590/console
[08:31:39] <jynus>	 and if you ask, yes, that is manual CI^
[08:31:47] <Amir1>	 nice
[08:32:30] <Amir1>	 jynus: once it's merged, is it possible to write the last lines of /tmp/rebuildTermSqlIndex.log into to file I specified in the puppet patch?
[08:32:42] <jynus>	 sure
[08:32:49] <Amir1>	 It will make the script to skip the first 500K items
[08:32:50] <Amir1>	 thanks
[08:34:16] <jynus>	 mmmm, what will it happen on rotation?
[08:35:34] <Amir1>	 nothing I guess, but the files are super super small
[08:35:37] <Amir1>	 it won't log much
[08:35:54] <Amir1>	 The file is not rotating 
[08:36:57] <jynus>	 maxage 180
[08:37:17] <jynus>	 hopefully this will not take more than 180 days :-)
[08:38:38] <Amir1>	 btw. the jobqueue is still not happy: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-30d&to=now
[08:43:34] <jynus>	 Amir1: so "cp /tmp/rebuildTermSqlIndex.log /var/log/wikidata/rebuildTermSqlIndex.log" ?
[08:43:43] <Amir1>	 yeah
[08:43:54] <jynus>	 last one is 488002
[08:43:59] <jynus>	 Q518873
[08:44:03] <Amir1>	 that's correct
[08:44:32] <jynus>	 tail -n 1 /var/log/wikidata/rebuildTermSqlIndex.log -> Processed up to page 488516 (Q519411)
[08:44:54] <Amir1>	 nice
[08:44:56] <Amir1>	 Thanks!
[08:45:11] <jynus>	 start monitoring that long when I tell you
[08:45:14] <jynus>	 *log
[08:45:20] <Amir1>	 sure
[08:45:26] <Amir1>	 running puppet agent?
[08:45:35] <jynus>	 I have to deploy it first
[08:45:44] <Amir1>	 kk
[08:47:26] <jynus>	 Amir1: note it will not only be better for config changes, you will not need to track the state most of the time
[08:48:10] <Amir1>	 sure
[08:49:14] <jynus>	 I can confirm it is on www-data's crontab # Puppet Name: wikibase-rebuildTermSqlIndex
[08:50:20] <Amir1>	 It hasn't started yet I think it will start in 40 minutes
[08:50:56] <jynus>	 I can do maybe a test run?
[08:51:06] <jynus>	 with lower timeout?
[08:51:35] <Amir1>	 yeah sure
[08:55:05] <jynus>	 running now
[08:55:34] <Amir1>	 works fine
[08:55:41] <Amir1>	 reported a new row
[08:55:59] <jynus>	 cool
[08:56:03] <jynus>	 I will let it finish
[08:56:10] <jynus>	 to see if something strange happens on kill
[08:57:04] <wikibugs_>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3547874 (10Marostegui)
[08:58:47] <jynus>	 Amir1: what do you think of this- I will prepare a temporary stop of the script, in case something goes wrong, you can tell any ops easily
[08:59:15] <Amir1>	 yeah sure
[08:59:23] <jynus>	 hopefully not needed
[09:13:47] <wikibugs_>	 10DBA, 10Wikidata, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Sprint: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460#3465512 (10jcrespo) So this is deployed into production, we did a test run and it seems to work as intended.  I left a "disable" patch h...
[10:02:29] <mark>	 jynus: are you joining the meeting?
[11:04:49] <volans>	 jynus, marostegui: a quick curiosity about the 30m restart time on new hardware, did mariadb improved what I reported in https://jira.mariadb.org/browse/MDEV-9930 ?
[11:06:03] <jynus>	 I can show you on graphs
[11:07:38] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1078&var-port=9104&from=1502901364771&to=1502922681478
[11:07:41] <jynus>	 vs
[11:08:38] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=1503378415766&to=1503572897594&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&var-port=13315
[11:09:01] <jynus>	 new host can be pooled after 30 minutes, even if it takes 6 hours to be ok
[11:09:23] <jynus>	 dbstores have problem catching up after restart
[11:09:46] <volans>	 ok, fair enough! thanks
[11:10:39] <volans>	 the only problem I see is that if we restart again before the 6h to get the full buffer pool re-populated we'll get a smaller buffer pool dump to reload 
[11:11:13] <volans>	 not very important on backup hosts that are not serving traffic anyway :)
[11:30:16] <wikibugs_>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3548140 (10Steinsplitter) 05Open>03Resolved done: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Karl-Heinz_Jansen  Thanks @Marostegui
[11:33:54] <wikibugs_>	 10DBA, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Patch-For-Review: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3548145 (10jcrespo) 05Open>03Resolved a:03jcrespo > To keep defragmenting on a regular basis?  Yes that is a horrible thing to do, but seeing reg...
[11:42:00] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3548155 (10jcrespo)
[11:43:26] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3548157 (10Marostegui)
[11:59:44] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3505388 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1096.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reim...
[12:21:25] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3548219 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1096.eqiad.wmnet'] ```  and were **ALL** successful.
[12:55:06] <wikibugs_>	 10DBA, 10MediaWiki-Watchlist, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: User with 40000 entries in their Watchlist cannot access it on Commons anymore: Database error - https://phabricator.wikimedia.org/T171898#3548310 (10Aklapper) Merging into {T171027} as it seems to be the same underlying...
[12:55:35] <wikibugs_>	 10DBA, 10MediaWiki-Watchlist, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: User with 40000 entries in their Watchlist cannot access it on Commons anymore: Database error - https://phabricator.wikimedia.org/T171898#3548312 (10Aklapper)
[15:19:47] <wikibugs_>	 10Blocked-on-schema-change, 10MediaWiki-Platform-Team, 10Structured-Data-Commons, 10Wikidata: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044#3549134 (10daniel)
[16:16:43] <wikibugs_>	 10Blocked-on-schema-change, 10MediaWiki-Platform-Team, 10Structured-Data-Commons, 10Wikidata: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044#3549380 (10daniel)
[16:19:28] <wikibugs_>	 10DBA, 10Operations, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3549410 (10jcrespo) It is my intention to reimage db1069, provisioning it from db1033 (s7) and pool it as a db1028 replacement, making both db1033 and db1028 obsolete (to be retire...
[16:42:29] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549499 (10jcrespo) 05Open>03Resolved a:03Cmjohnson > Should we close this ticket and create a new one for testing another host and see its behaviour?   Let's just do it.
[16:47:02] <wikibugs_>	 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3549516 (10jcrespo)
[16:47:16] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549532 (10Cmjohnson) let me know which db you want to test and when?
[16:47:18] <wikibugs_>	 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3549533 (10jcrespo) p:05Triage>03Normal
[16:47:57] <wikibugs_>	 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3549535 (10Marostegui) This is what I commented on the other ticket: ``` I would like to propose db1076 (s2) as a candidate host to do the test once db1078 is back in the pool with the...
[16:48:56] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549544 (10jcrespo) Let's us some time to find a good candidate and create some fake load and we will ping either you or Papaul on T174054.
[16:52:47] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3549555 (10Marostegui)
[19:13:43] <wikibugs_>	 10DBA, 10Operations: decomission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3550282 (10jcrespo)
[19:26:08] <wikibugs_>	 10DBA, 10Operations, 10Patch-For-Review: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3550329 (10jcrespo)
[19:27:56] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3550377 (10Cmjohnson) Return shipping info for disk  UPS 1ZW0948Y9082750467