一次离奇的JVM Crash

背景

最近线上某个服务出现了 JVM Crash 现象,并且每次都发生在比较相近的时间段里,下面是某一次的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007ff3ed5c1811, pid=192, tid=140655176587008
#
# JRE version: Java(TM) SE Runtime Environment (8.0_45-b14) (build 1.8.0_45-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.45-b02 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 9931 C2 java.util.ArrayList.contains(Ljava/lang/Object;)Z (14 bytes) @ 0x00007ff3ed5c1811 [0x00007ff3ed5c17c0+0x51]
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
--------------- T H R E A D ---------------
Current thread (0x00007ff36c20f000): JavaThread "ForkJoinPool-55-worker-0" daemon [_thread_in_Java, id=84902, stack(0x00007fecd5c61000,0x00007fecd5d62000)]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000009
Registers:
RAX=0x00000007c001b6c0, RBX=0x0000000000000002, RCX=0x00000000ebfedc3e, RDX=0x0000000000000001
RSP=0x00007fecd5d5fb10, RBP=0x0000000000000000, RSI=0x000000075ff6e108, RDI=0x0000000000000002
R8 =0x0000000000000002, R9 =0x00000000ebfedc32, R10=0x00000007c001b6c0, R11=0x0000000000000000
R12=0x0000000000000000, R13=0x0000000000000002, R14=0x000000075ff6e1d8, R15=0x00007ff36c20f000
RIP=0x00007ff3ed5c1811, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
TRAPNO=0x000000000000000e
Top of Stack: (sp=0x00007fecd5d5fb10)
0x00007fecd5d5fb10: 000000000000003f 0000000000000000
0x00007fecd5d5fb20: 00000005c5a0b488 00000007c0412700
0x00007fecd5d5fb30: 00000000ebfe7804 000000075fecbe70
0x00007fecd5d5fb40: 00000000b8ca514c 00007ff3f2b5e2ec
0x00007fecd5d5fb50: 00007fecd5d5fb90 00007fecd5d5fb80
0x00007fecd5d5fb60: 00007ff248698ec0 000000001e61f9cd
...
Instructions: (pc=0x00007ff3ed5c1811)
0x00007ff3ed5c17f1: c4 e1 f9 6e ce 8b 4e 14 44 8b 46 10 33 ed 48 85
0x00007ff3ed5c1801: d2 0f 84 95 04 00 00 45 85 c0 0f 8e 65 03 00 00
0x00007ff3ed5c1811: 44 8b 52 08 41 81 fa da 02 00 f8 0f 85 a8 05 00
0x00007ff3ed5c1821: 00 45 8b 6c cc 0c 49 8d 2c cc 45 85 ed 0f 86 48
Register to memory mapping:
RAX=0x00000007c001b6c0 is pointing into metadata
RBX=0x0000000000000002 is an unknown value
RCX=0x00000000ebfedc3e is an unknown value
RDX=0x0000000000000001 is an unknown value
RSP=0x00007fecd5d5fb10 is pointing into the stack for thread: 0x00007ff36c20f000
RBP=0x0000000000000000 is an unknown value
RSI=
[error occurred during error reporting (printing register info), id 0xb]
Stack: [0x00007fecd5c61000,0x00007fecd5d62000], sp=0x00007fecd5d5fb10, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 9931 C2 java.util.ArrayList.contains(Ljava/lang/Object;)Z (14 bytes) @ 0x00007ff3ed5c1811 [0x00007ff3ed5c17c0+0x51]
--------------- P R O C E S S ---------------
Java Threads: ( => current thread )
0x00007fee48009800 JavaThread "hystrix-fuiou_default-10" daemon [_thread_in_native, id=84913, stack(0x00007fecd6a6f000,0x00007fecd6b70000)]
=>0x00007ff36c20f000 JavaThread "ForkJoinPool-55-worker-0" daemon [_thread_in_Java, id=84902, stack(0x00007fecd5c61000,0x00007fecd5d62000)]
0x00007fee4400a000 JavaThread "ForkJoinPool-55-worker-3" daemon [_thread_blocked, id=84901, stack(0x00007fecd5d62000,0x00007fecd5e63000)]
...
Internal exceptions (10 events):
Event: 42112.006 Thread 0x00007fee6c014800 Exception <a 'java/lang/ClassNotFoundException': com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer> (0x000000077aad98c0) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/classfile/systemDictionary.cpp
Event: 42112.007 Thread 0x00007fee6c014800 Exception <a 'java/lang/ClassNotFoundException': com/xueqiu/fundx/market/domain/FundDataStockDetailPercentBeanInfo> (0x000000077ab42100) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/classfile/systemDictionary.cpp,
Event: 42112.008 Thread 0x00007fee6c014800 Exception <a 'java/lang/ClassNotFoundException': com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer> (0x000000077aba3a70) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/classfile/systemDictionary.cpp
Event: 42112.009 Thread 0x00007fee6c014800 Exception <a 'java/lang/ClassNotFoundException': com/xueqiu/fundx/market/domain/FundDataStockDetailPercentBeanInfo> (0x000000077a40c330) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/classfile/systemDictionary.cpp,
Event: 42112.009 Thread 0x00007fee6c014800 Exception <a 'java/lang/ClassNotFoundException': com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer> (0x000000077a46dca0) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/classfile/systemDictionary.cpp
Event: 42112.192 Thread 0x00007ff154083000 Exception <a 'java/net/SocketTimeoutException': Read timed out> (0x0000000779e96098) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/prims/jni.cpp, line 735]
Event: 42112.817 Thread 0x00007ff154265800 Exception <a 'java/net/SocketTimeoutException': Read timed out> (0x000000077743dc38) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/prims/jni.cpp, line 735]
Event: 42113.412 Thread 0x00007ff15427e000 Exception <a 'java/net/SocketTimeoutException': Read timed out> (0x00000007747e4f80) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/prims/jni.cpp, line 735]
Event: 42114.344 Thread 0x00007ff154351000 Exception <a 'java/net/SocketTimeoutException': Read timed out> (0x000000076d9d3658) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/prims/jni.cpp, line 735]
Event: 42115.538 Thread 0x00007ff36c20f000 Implicit null exception at 0x00007ff3ed5c1811 to 0x0000000000000000
Events (10 events):
Event: 42112.009 loading class com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer
Event: 42112.009 loading class com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer done
Event: 42112.009 loading class com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer
Event: 42112.009 loading class com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer done
Event: 42112.009 loading class com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer
Event: 42112.009 loading class com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer done
Event: 42112.284 Thread 0x00007ff3fce22000 flushing nmethod 0x00007ff3f5a26250
Event: 42112.284 Thread 0x00007ff3fce22000 flushing nmethod 0x00007ff3f5a43390
Event: 42112.288 Thread 0x00007ff3fce22000 flushing nmethod 0x00007ff3f63395d0
Event: 42114.505 Thread 0x00007ff0cc017800 Thread added: 0x00007ff0cc017800

分析错误日志

从日志中能获取到的重要信息

1
2
3
4
5
6
7
8
9
10
11
12
13
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000009
SIGSEGV: A segmentation fault occurs when a program attempts to access a memory location that it is not allowed to access, or attempts to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location, or to overwrite part of the operating system).
访问了一个无效的内存引用(可能是不允许访问的内存地址,或者是禁止的访问方式)
SEGV_MAPERR :地址没有映射到对象
siaddr: 0x0000000000000009
这个就是访问的地址,0x00000000-0x0000FFFF 是进程NULL区的保留地址,禁止访问
Event: Thread 0x00007ff36c20f000 Implicit null exception at 0x00007ff3ed5c1811 to 0x0000000000000000
提示了这是一个隐式空指针问题

从线程名中找到对应的代码逻辑块,并没有发现对 ArrayList.contains 的调用,随后通过使用 udcli 转译指令成汇编,貌似也没有找到匹配的代码

1
2
3
4
➜ ~ echo '44 8b 52 08 41 81 fa da 02 00 f8 0f 85 a8 05 00' |udcli -intel -x -64 -o 0x00007ff3ed5c1811
00007ff3ed5c1811 448b5208 mov r10d, [rdx+0x8] //转移 rdx+8 (也就是出问题的地址)
00007ff3ed5c1815 4181fada0200f8 cmp r10d, 0xf80002da //比较俩个值(相减),应该是个比大小的操作
00007ff3ed5c181c 0f85a80500 invalid

暂时没有什么头绪,想借助 core dump 来进行更深入的分析,但因为我们应用跑在 Docker 里,core dump 默认没有打开。

首先找 SRE 团队配合改一下 Docker 的配置,另外在应用内加上了更多的定位日志,希望能定位到具体行。

1
2
3
1. docker run 参数里的加上 --ulimit core=-1
2. 在 /etc/sysctl.conf 里可以配置文件生成路径
kernel.core_pattern = /persist/core_%e_%p

尝试定位修复

等待下一次 crash 期间也做了几件事

  • 猜测是否是使用了 forkjoin 框架本身的问题,随后替换成标准的线程池,无用

  • 根据日志定位到了大概的代码块,是一个普通是 foreach 中,仔细分析后似乎没有发现问题

  • 再次仔细的分析了下crash 日志,发现每次都会有相应的 jni 调用及某个类的异常,如下

    1
    2
    Event: 42112.009 Thread 0x00007fee6c014800 Exception <a 'java/lang/ClassNotFoundException': com/xueqiu/fundx/market/domain/FundDataStockDetailPercentCustomizer> (0x000000077a46dca0) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/classfile/systemDictionary.cpp
    Event: 42112.192 Thread 0x00007ff154083000 Exception <a 'java/net/SocketTimeoutException': Read timed out> (0x0000000779e96098) thrown at [/RE-WORK/workspace/8-2-build-linux-amd64/jdk8u45/3457/hotspot/src/share/vm/prims/jni.cpp, line 735]

    根据线程分析,这是一个MQ的触发事件,但如果是JNI导致的错误,为什么是另外一个线程的异常 Crash,这里没有想通

分析 core dump

拿到 core dump 后,首先想通过 jstack 获取到具体的异常栈信息,栈信息已丢失

1
Thread 84902: (state = IN_JAVA) //线程正在执行编译后的Java代码

尝试使用 jmap 导出内存 dump,失败

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
app@bjxg-vd-9-201:~$ jmap -dump:format=b,file=heap.hprof /usr/bin/java core_java_192
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true -Dfile.encoding=UTF-8
Attaching to core core_java_192 from executable /usr/bin/java, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
Dumping heap to heap2.hprof ...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at sun.tools.jmap.JMap.runTool(JMap.java:201)
at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: sun.jvm.hotspot.utilities.AssertionFailure: can not get class data for java/util/stream/FindOps$$Lambda$210x00000007c019bc28
at sun.jvm.hotspot.utilities.Assert.that(Assert.java:32)
at sun.jvm.hotspot.utilities.HeapHprofBinWriter.writeInstance(HeapHprofBinWriter.java:803)
at sun.jvm.hotspot.utilities.AbstractHeapGraphWriter$1.doObj(AbstractHeapGraphWriter.java:95)
at sun.jvm.hotspot.oops.ObjectHeap.iterateLiveRegions(ObjectHeap.java:353)
at sun.jvm.hotspot.oops.ObjectHeap.iterate(ObjectHeap.java:171)
at sun.jvm.hotspot.utilities.AbstractHeapGraphWriter.write(AbstractHeapGraphWriter.java:51)
at sun.jvm.hotspot.utilities.HeapHprofBinWriter.write(HeapHprofBinWriter.java:433)
at sun.jvm.hotspot.tools.HeapDumper.run(HeapDumper.java:62)
at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.HeapDumper.main(HeapDumper.java:83)
... 6 more

随后尝试使用 gdb 加载分析

1
2
3
4
5
6
7
8
9
10
11
(gdb) bt
#0 0x00007ff40130ecc9 in _quicksort (pbase=0x0, total_elems=<optimized out>, size=32, cmp=0x7fecd5d5d420, arg=0x7ff401269640 <OnError>) at qsort.c:128
#1 0x0000000000010206 in ?? ()
#2 0x0000000000000033 in ?? ()
#3 0x0000000000000004 in ?? ()
#4 0x000000000000000e in ?? ()
#5 0xfffffffe7ffbfaff in ?? ()
#6 0x0000000000000030 in ?? ()
#7 0x00007fecd5d5d880 in ?? ()
#8 0x00007ff4009ee156 in jni_GetObjectField () from /data/jdk1.8.0_45/jre/lib/amd64/server/libjvm.so
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Backtrace 最后一帧是一个qsort 函数,这里和之前汇编 cmp 操作能对应上

至此大概猜测是某种原因破坏了堆内存,例如 JNI 调用使用了错误的参数触发了空指针等

最后尝试一下更新 JDK 版本,1.8.0_45 -> 1.8.0_201,尝试之后至今还未再次发生Crash 事件,难道真的是一个JDK 的bug 吗?