连接池误用引发的服务失去响应

记录一次线上问题排查过程。

症状

服务完全无法处理用户请求,返回HTTP504状态码,无日志输出。但进程正常运行,端口监听正常。

内存dump排查

有人曾经说过:“世界上的Java程序员,有一半在写代码的时候制造内存问题,另一半在解决内存问题”。

所以不管3721先dump一下内存快照看下再说(命令jmap -dump:format=b,file=filename [pid])。

很遗憾,内存各个区都很富余,问题不在内存。

查看jvm监控

公司的JVM监控:基于Prometheus从spring cloud actuator接口中采集jvm监控数据,然后生成Dashboard图。

threads growup

发现一处可疑的线程数增长情况。

注:此处可以看到积极接入监控系统对jvm问题排查的关键作用,平时不监控,排查两行泪

线程dump排查

发现线程的可疑情况后,将服务线程dump出来(命令jstack -l [pid])。建议将dump文件下载下来,放进分析工具进行分析(这里推荐一个在线的分析工具),肉眼看原始文件真的很累。

threads stack trace

dump分析工具将线程调用栈进行了分组,可以看到有200个类似以下调用栈的线程waiting中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
http-nio-8080-exec-796
priority:5 - threadId:0x00007fa1f8098800 - nativeId:0x1dfe - nativeId (decimal):7678 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000a3e59118> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:583)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:442)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
at redis.clients.util.Pool.getResource(Pool.java:49)
at redis.clients.jedis.JedisPool.getResource(JedisPool.java:226)
at com.zhongan.redis.JedisPoolManager.getJedis(JedisPoolManager.java:83)
at com.zhongan.castle.msspy.common.CastleRedisClient.callback(CastleRedisClient.java:27)
at com.zhongan.castle.msspy.common.CastleRedisClient.get(CastleRedisClient.java:60)
at com.zhongan.castle.msspy.controller.BigScreenMonitorController.searchQuoteSummaryInfo(BigScreenMonitorController.java:640)
at com.zhongan.castle.msspy.controller.BigScreenMonitorController$$FastClassBySpringCGLIB$$1aae1c42.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:738)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:85)
at com.zhongan.sleuth.trace.aspect.ZaTraceAspect.wrapWithCorrelationId(ZaTraceAspect.java:74)
at sun.reflect.GeneratedMethodAccessor445.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:629)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:618)
at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:168)
at org.springframework.aop.aspectj.AspectJAfterThrowingAdvice.invoke(AspectJAfterThrowingAdvice.java:62)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:168)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:673)
at com.zhongan.castle.msspy.controller.BigScreenMonitorController$$EnhancerBySpringCGLIB$$d22bffc0.searchQuoteSummaryInfo(<generated>)
at sun.reflect.GeneratedMethodAccessor1110.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:205)
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:133)
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:97)
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:827)
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:738)
at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:85)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:967)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:901)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:635)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.boot.web.filter.ApplicationContextHeaderFilter.doFilterInternal(ApplicationContextHeaderFilter.java:55)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.boot.actuate.trace.WebRequestTraceFilter.doFilterInternal(WebRequestTraceFilter.java:110)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at com.zhongan.sso.client.AuthenticationFilter.doFilter(AuthenticationFilter.java:138)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.HttpPutFormContentFilter.doFilterInternal(HttpPutFormContentFilter.java:108)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:81)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.cloud.sleuth.instrument.web.TraceFilter.doFilter(TraceFilter.java:186)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:197)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.boot.actuate.autoconfigure.MetricsFilter.doFilterInternal(MetricsFilter.java:106)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:478)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:80)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:799)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1457)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked <0x00000000ff519ab0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x00000000ff4e63e8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

http-nio-8080-exec线程

熟悉tomcat的可以跳过本段

往调用链的源头看,可以猜到这个线程是tomcat用的,大概率是线程池。

我们来看一下源码

spring-boot内嵌的tomcat8默认使用nio连接处理请求,NioEndpoint在初始化时创建了线程池,所以个人认为这个线程池应该称之为worker池更为贴切。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/**
* Start the NIO endpoint, creating acceptor, poller threads.
*/
@Override
public class NioEndpoint extends AbstractJsseEndpoint<NioChannel> {
...
public void startInternal() throws Exception {

if (!running) {
...
// Create worker collection
if ( getExecutor() == null ) {
createExecutor();
}
}
}
}

AbstractEndpoint中实现了连接池的创建,可以看到线程名称前缀为getName()+’-exec-‘,这个getName()的代码就不贴了,结果就是http-nio-8080-这个前缀。

1
2
3
4
5
6
7
8
9
10
public abstract class AbstractEndpoint<S> {
...
public void createExecutor() {
internalExecutor = true;
TaskQueue taskqueue = new TaskQueue();
TaskThreadFactory tf = new TaskThreadFactory(getName() + "-exec-", daemon, getThreadPriority());
executor = new ThreadPoolExecutor(getMinSpareThreads(), getMaxThreads(), 60, TimeUnit.SECONDS,taskqueue, tf);
taskqueue.setParent( (ThreadPoolExecutor) executor);
}
}

这个线程主要的作用是处理socket连接:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public boolean processSocket(SocketWrapperBase<S> socketWrapper,
SocketEvent event, boolean dispatch) {
try {
...
SocketProcessorBase<S> sc = processorCache.pop();
if (sc == null) {
sc = createSocketProcessor(socketWrapper, event);
} else {
sc.reset(socketWrapper, event);
}
Executor executor = getExecutor();
if (dispatch && executor != null) {
executor.execute(sc);
} else {
sc.run();
}
}
...
return true;
}

这个线程池的maxThreads默认是200,与线程栈中200个waiting线程也对应上了。

所以此处可以断定,是tomcat的线程池用完了,无法处理新的请求。

线程stack详细分析

观察http-nio-8080-exec线程调用栈的顶部发现,线程是堵塞在获取redis获取连接池的方法了。关于连接池的实现原理可以参考ApacheHttpClient连接池源码分析

查看服务的配置可以看到,redis连接池为100,为什么会出现一个连接都取不到,而所有的线程都在等待Redis连接的情况呢?

大胆猜测可能是有代码在borrow连接之后,没有release,造成连接越来越少,最终连接池枯竭。

查看代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public class CastleRedisClient {

private JedisPoolManager jedisPoolManager;

public static interface RedisCallback<T> {
T callback(Jedis jedis);
}

public <T> T callback(CastleRedisClient.RedisCallback<T> callback) {
Jedis jedis = null;
T result = null;
try {
jedis = jedisPoolManager.getJedis();
result = callback.callback(jedis);
} catch (Exception e) {
log.error("RedisException:", e);
} finally {
jedisPoolManager.close(jedis);
}
return result;
}

public Object get(final Serializable key) {
Object ob = callback(new CastleRedisClient.RedisCallback<Object>() {
@Override
public Object callback(Jedis jedis) {
byte[] result = jedis.get(SerializeUtil.serialize(key));
return null == result ? result : SerializeUtil.deserialize(result);
}
});

return ob;
}
...

public JedisPoolManager getJedisPoolManager() {
return jedisPoolManager;
}

public void setJedisPoolManager(JedisPoolManager jedisPoolManager) {
this.jedisPoolManager = jedisPoolManager;
}
}

这个client使用了模板方法设计模式,将callback作为抽象方法留给子类实现,这样就保证了主流程:取连接–>查数据–>释放连接 的公用流程的复用。

但是,这里不恰当地将本应该封装在内部的JedisPoolManager对象通过getter方法逸出了,造成了有代码(确实有,这里就不贴了)没有按照上面的流程取到了连接而没有释放,最终造成了连接池枯竭,影响了整个服务的可用性。

超时时间

这里插播一条插曲,在排查过程中,发现明明设置了redis超时时间,为什么线程没有进入timed_waiting状态,而进入了无时间限制的waiting

通过阅读源码发现,这个超时时间,其实是设置的redis连接的超时时间,而不是从连接池取连接的最大等待时间(这个时间默认值是-1,即无限等待)。开发过程中有很多超时时间,切忌想当然地理解其作用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public void connect() {
if (!isConnected()) {
try {
socket = new Socket();
// ->@wjw_add
socket.setReuseAddress(true);
socket.setKeepAlive(true); // Will monitor the TCP connection is
// valid
socket.setTcpNoDelay(true); // Socket buffer Whetherclosed, to
// ensure timely delivery of data
socket.setSoLinger(true, 0); // Control calls close () method,
// the underlying socket is closed
// immediately
// <-@wjw_add

socket.connect(new InetSocketAddress(host, port), connectionTimeout);
socket.setSoTimeout(soTimeout);
...
}
}
}

总结

这次排查是个很典型的线上问题,所以在这里啰嗦地总结了几点:

  1. 服务监控的重要性
  2. 合理使用Java调试工具
  3. 一个不严谨封装的代码,很可能会带来严重的后果