Project

General

Profile

Task #10961

Handling hot file in w-cmsstor319-disk-disk3

Added by Chih-Hao Huang about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Start date:
11/19/2015
Due date:
11/26/2015
% Done:

100%

Estimated time:
2.00 h
Spent time:
Duration: 8

Description

This is a place holder for handling hot file issue in w-cmsstor319-disk-disk3

History

#1 Updated by Chih-Hao Huang about 4 years ago

  • % Done changed from 0 to 80

[1] an alarm showed up in zabbix regarding ICMP response took too long on cmsstor319
[2] checked network, there were overrun on eth1

[root@cmsstor319 dcache]# ifconfig
bond0 Link encap:Ethernet HWaddr 00:25:90:1B:E8:AD
inet addr:131.225.191.233 Bcast:131.225.191.255 Mask:255.255.252.0
inet6 addr: fe80::225:90ff:fe1b:e8ad/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:5283978207 errors:0 dropped:0 overruns:2257 frame:0
TX packets:11268419058 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:50000
RX bytes:1414486675399 (1.2 TiB) TX bytes:16712275986625 (15.1 TiB)

eth0 Link encap:Ethernet HWaddr 00:25:90:1B:E8:AD
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:426625069 errors:0 dropped:0 overruns:0 frame:0
TX packets:5682297850 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:50000
RX bytes:140840241148 (131.1 GiB) TX bytes:8416538286271 (7.6 TiB)
Memory:faf60000-faf7ffff

eth1 Link encap:Ethernet HWaddr 00:25:90:1B:E8:AC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:4857353934 errors:0 dropped:0 overruns:2257 frame:0
TX packets:5586128036 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:50000
RX bytes:1273646517961 (1.1 TiB) TX bytes:8295747937397 (7.5 TiB)
Memory:fafe0000-faffffff

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:2156241 errors:0 dropped:0 overruns:0 frame:0
TX packets:2156241 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:441885491 (421.4 MiB) TX bytes:441885491 (421.4 MiB)

[3] service network restart did not solve the problem, ping was still in 100ms range
[4] ganglia shows out going network has been saturated since 2015-11-18 13:00
[5] checked pool request queue (http://cmsdcacheadmindisk.fnal.gov:2288/queueInfo) and found many movers (> 180) queued for w-cmsstor319-disk-disk3
[6] checked cmsstor319:/var/log/w-cmsstor319-disk-disk3Domain.log and found many failures in one file:

........
18 Nov 2015 23:17:54 (w-cmsstor319-disk-disk3) [XrootdLFNs-cmssrmdisk PoolDeliverFile 0000032A6E4D83324AEC80C8C5C8DB739CE8] Transfer failed: java.lang.InterruptedException
18 Nov 2015 23:17:54 (w-cmsstor319-disk-disk3) [XrootdLFNs-cmssrmdisk PoolDeliverFile 0000032A6E4D83324AEC80C8C5C8DB739CE8] Transfer failed: java.lang.InterruptedException
18 Nov 2015 23:17:54 (w-cmsstor319-disk-disk3) [XrootdLFNs-cmssrmdisk PoolDeliverFile 0000032A6E4D83324AEC80C8C5C8DB739CE8] Transfer failed: java.lang.InterruptedException
18 Nov 2015 23:17:54 (w-cmsstor319-disk-disk3) [XrootdLFNs-cmssrmdisk PoolDeliverFile 0000032A6E4D83324AEC80C8C5C8DB739CE8] Transfer failed: java.lang.InterruptedException
........
[7] determined 0000032A6E4D83324AEC80C8C5C8DB739CE8 is a hot file
[8] manually replicated it to other (5) pools

[cmsdcacheadmindisk.fnal.gov] (PnfsManager) admin > cacheinfoof 0000032A6E4D83324AEC80C8C5C8DB739CE8
w-cmsstor412-disk-disk1 w-cmsstor413-disk-disk1 w-cmsstor411-disk-disk1 w-cmsstor410-disk-disk1 w-cmsstor414-disk-disk1 w-cmsstor319-disk-disk3
[9] wait and see if it cools down

#2 Updated by Chih-Hao Huang about 4 years ago

  • % Done changed from 80 to 100

It is all cleared.
However, eth1 is still showing more overrun.
Will investigate more.

#3 Updated by Chih-Hao Huang about 4 years ago

  • Status changed from Assigned to Resolved

Indeed, the network issue is the consequence of interface being hit too hard by too many requests.
When the hot file was replicated to other pools, the activity died down, no new overrun errors.
However, it is still a mystery that only eth1 showed overrun errors, not eth0.



Also available in: Atom PDF