# Interface alarms and inspection guidelines
# Summary
WeChat The public platform has an open interface alarm, when the number of failures of WeChat server to push messages to developers reaches a predetermined threshold,Will send the alert message to the designated WeChat alert group (set up: public platform- > development- > operation and maintenance- > interface alert), please actively pay attention to the alert, solve the problem immediately, improve the WeChat Service Account quality of service.
To better detect problems based on the instance at the end of the alarm message (which provides openid and timestamp stamp), developers need to add detailed logs containing critical information at each level, such as the access layer, logic layer, etc., to help quickly locate the problem.
# Interface alarm
There are currently two types of alarms:
# Universal alarm
All developers need to pay attention.
type | describe |
---|---|
DNS Failure | WeChat DNS resolution failed when the server pushed a message or event to Service Account |
DNS timeout | WeChat When the server pushes a message or event to Service Account, the resolution DNS timeout is 5 seconds |
Connection timeout | WeChat Server connection Service Account Developer server timeout occurred for 5 seconds |
Request timeout | WeChat After the server pushes a message or event to Service Account, the developer does not return within 5 seconds |
The response failed. | WeChat When the server pushes a message or event to Service Account, it gets an illegal response |
MarkFail (auto-fencing) | WeChat When the server fails to push a message or event to Service Account several times, it does not push the message temporarily, and unblocks it after one minute |
# Service Account Third Party Platform Alarm
You need to pay attention to this alarm only if you apply to be a developer of Service Account Third Party Platform on the WeChat open platform (open.weixin.qq.com).
type | describe |
---|---|
Push component_verify_ticket timeout | When the component_verify_ticket is pushed, the developer does not return within 5S |
Failed to push component_verify_ticket | The developer did not return success when pushing component_verify_ticket |
Push Third Party Platform message timeout | Push Third Party Platform message (such as unauthorization message), etc., the third-party platform does not return within 5 seconds |
Failed to push Third Party Platform message | Push Third Party Platform message (such as unauthorization message), etc., the third-party platform does not return success |
# Dxplaination of the contents of the alarm
Description of the alarm:
- AppID : Service Account appid
- Nickname: Service Account Nickname
- Time: All alerts provide the time when an anomaly occurred for the first time. (e.g., the time when a timeout occurred for the first time, the time the response failed for the first instance)
- Content: Specific description of the error
- Number of times: number of failures that occur
- Error Sample: The error sample contains some information to help find the problem. Such as: the first timeout developer's IP and push message type. In the case of a failed response, the error examples also indicate the developer's repackaging when the first response failed.
In general, the IP, time, and message type provided by the alarm can be relatively quick to locate the cause of the third-party problem.
# Type of alarm
Examples of specific warnings and guidelines for screening are given below.
# Overtime alarm
Appid: wxxxxxx
昵称: WxNickName
时间: 2014-12-01 20:12:00
内容: 微信服务器向服务号推送消息或事件后,开发者5秒内没有返回
次数: 5分钟 1272次
错误样例: [IP=203.205.140.29][Event=UnSubscribe]
This alarm indicates that when the WeChat server pushed the unfollow event to the developer, the developer did not return a result within 5 seconds.It happened 1272 times in the 5 minutes of 2014-12-01 20: 12: 00-2014-12-01 20: 17: 00 The time of the first timeout in this 5 minutes is: 2014-12-01 201200, the developer's IP is: 203.205.140.29, and the event type is unfollow.
# The response failed.
Appid: wxxxx
昵称: WxNickName
时间: 2014-12-01 20:12:00
内容: 微信服务器向服务号推送消息或事件后,得到的回应不合法
次数: 5分钟 1320次
错误样例: [Event=Click] [ip=58.248.9.218][response_length=10][response_content=Error 500:]
This alarm indicates that when the WeChat server pushes a custom menu click event to the developer, the developer returns an illegal result.It happened 1320 times in the 5 minutes of 2014-12-01 20: 12: 00-2014-12-01 20: 17: 00 The first time a response failed within these 5 minutes was: 2014-12-01 20: 12: 00, the developer's IP was: 58.248.9.218, the event type was a click-menu event, and the content returned by the third party was 10 bytes long with the content "Error 500:."
# Connection timeout
Appid: wxxxx
昵称: WxNickName
时间: 2015-02-04 20:13:09
内容: 微信服务器连接服务号开发者服务器时发生超时,超时时间为5秒
次数: 5分钟 7289次
错误样例: [IP=180.150.190.135][Msg=Text]
The alarm said: WeChat server to push the text message sent by fans to the developer, can not connect to the developer to fill in the server address.It happened 7,289 times in the 5 minutes of 2015-02-04 20: 13: 09-2015-02-04 20: 18: 00, The first connection timeout occurred in this 5 minutes is: 2015-02-04 201309, the developer's IP is: 180.150.190.135, the event type is the user pushed message.
# Diagnostic guidelines
# DNS Failure
The error is that the WeChat server failed to resolve DNS while pushing a message to the developer.If you encounter this alarm, please developers confirm:
- The URL and domain name are incorrect;
- Whether the domain name has changed, such as expired, updated, etc.
If not above 2 questions, please contact WeChat public platform.Solution to the problem:
- Ping tests the domain name in the url configured on your MP to confirm that you can get the correct IP. If you can't get it or if there's an error, check the configuration on your domain name hosting management system.
- If 1 can get the correct IP, and DNS failure alarm; Please use the DNS server 182.254.116.116 to retest validation. Linux: dig @ 182.254.116.116 domain name; Windows changes the DNS server address in the network configuration, and then ping the domain name.If the IP received is incorrect or not available, please contact the WeChat team.
# DNS timeout
There will be no such error at this time.
# Connection timeout
The error is a failed connection within the WeChat server and developer server 3S.The alarm message provides the time of the first connection failure and the IP of the connection. If this alarm occurs, please developers confirm:
- The IP is wrong.
- The IP machine is overloaded with too many connections.
- If a third party provides server hosting, whether the host is faulty.
- Is there a problem with the network operator?
- You can whitelist the IP of the WeChat server if a network policy, such as a firewall, is set.See for WeChat Server IP address
- If the network is blocked, it can be checked by network detection .
The solution to the problem is as follows:
- See if there is a network environment problem.Use to obtain WeChat push server IP interface to obtaining the IP of WeChat callback server, and ping test on your service to check the network quality of your server to WeChat calling server.If you have network problems, please contact your server provider to resolve them.
- View access layer server connections, load, nginx configuration, the number of connections allowed. Check to see if the nginx error log has“Connection reset by peer"or" Connection timed out "error log, if any, indicating that the number of nginx connections is overloaded.
- It is recommended to build test tools to perform heartbeat checks on the system, and to monitor and alert the system load, connection number, processing number, and processing time in real time.
For nginx configuration, here is the official documentation and a simple configuration introduction link, I hope to help: http://nginx.org/en/docs/, focusing on connection configuration, log configuration, etc.Some important configuration examples of nginx are as follows:
worker_processes 16; //CPU核数
error_log logs/error.log info; //错误日志log
worker_rlimit_nofile 102400; //打开最大句柄数
events {
worker_connections 102400; //允许最大连接数
}
//请求日志记录,关键字段:request_time-请求总时间,upstream_response_time后端处理时 间
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" "$host" "$cookie_ssl_edition" '
'"$upstream_addr" "$upstream_status" "$request_time" '
'"$upstream_response_time" ';
access_log logs/access.log main;
# Request timeout
WeChat The server pushes a message or event to the developer server, and the developer does not return within 5 seconds.When a request timeout occurs, the alarm message provides the time when the request timeout first occurred, the developer IP, and the message type. Please developers confirm:
- Is this IP wrong?
- Whether the IP receives a request for the message type given by the alarm message
- Is the request taking too long to process
Solution:
Each module needs a complete log that can examine the time consumed of each request in each module, and with the WeChat alarm providing information to easily locate which server is faulty.Common reasons are:
- The load on the machine is too high and the time consumed increases
- Machine handling exceptions, message missing
- Machine exception, for the machine handling exception, it is recommended to fix the bug as soon as possible, for the machine exception, please shield the machine in question as soon as possible. The load on the machine here is too high, and a viable solution is simply offered.
Option 1: Optimize performance and expand. Check the load (CPU, memory, IO, network, see the appendix), depending on the specific performance bottleneck, take different optimization methods.
Option 2: Asynchronous processing.If WeChat server push the message too late for real time processing, you can store the message first, first return success to WeChat server, the background can follow-up processing the message, if you need to reply to the user message, you can call the customer service message interface API to reply to the user message.
# The response failed.
If the developer does not respond to the document's response message format, or if a network error occurs, a response failure will be alerted. The alert message provides the time of the first request response failure, the developer's IP, the message type, and the message content of the response. Please confirm:
- Is this IP wrong?
- Is there a network error on this IP
- Whether the business processing logic did not respond to the message according to wiki specifications, or entered exception logic.
# MarkFail (auto-fencing)
WeChat The background will count the number of developers' failures in real time.When a large number of failures occur in pushing a message to a developer, the WeChat server automatically blocks the developer, no longer pushes any message within 1 minute, and sends an alert to the WeChat group. This alarm is the highest level alarm. Developers should handle the backend failure and restore services as soon as they receive this alarm.In fact, before developers receive this alarm, they will inevitably receive alarms such as connection timeout, request timeout or response failure, which requires developers to solve these faults in real time to avoid being blocked by WeChat servers and seriously affect the Service Account service!
# Push component_verify_ticket timeout
Only Service Account Third Party Platform developers will receive it, other service number developers need not pay attention to.As the service number of the third party carries more service numbers, the quality of service of the service number of the third party requires more stringent requirements and alerts, so the four special incidents are reported separately.For the specific application and development of the service number third-party platform, please go to WeChat open platform (open.weixin.qq.com)
# Failed to push component_verify_ticket
Only Service Account Third Party Platform developers will receive it, other service number developers need not pay attention to.As the service number of the third party carries more service numbers, the quality of service of the service number of the third party requires more stringent requirements and alerts, so the four special incidents are reported separately.For the specific application and development of the service number third-party platform, please go to WeChat open platform (open.weixin.qq.com)
# Push component messages over time
Only Service Account Third Party Platform developers will receive it, other service number developers need not pay attention to.As the service number of the third party carries more service numbers, the quality of service of the service number of the third party requires more stringent requirements and alerts, so the four special incidents are reported separately.For the specific application and development of the service number third-party platform, please go to WeChat open platform (open.weixin.qq.com)
# Push component message failed
Only Service Account Third Party Platform developers will receive it, other service number developers need not pay attention to.As the service number of the third party carries more service numbers, the quality of service of the service number of the third party requires more stringent requirements and alerts, so the four special incidents are reported separately.For the specific application and development of the service number third-party platform, please go to WeChat open platform (open.weixin.qq.com)
# Common tools
The following is a brief introduction to the common tools used to view server performance load. For more information on how to use the tools, see separately.
# View the CPU's performance load
- Uptime: Used to observe the overall load of the server, the system load refers to the average length of the run queue (1 minute, 5 minutes, 15 minutes ago), which normally needs to be less than the number of CPUs.
- Vmstat: vmstat stands for Virtual Meomory Statistics, which monitors virtual memory, processes, and CPU activity of operating systems. He is the overall situation of the system statistics, usually using vmstat 5 5 (meaning every 5 seconds to generate data, generate five times) command test. A summary of data will be obtained that reflects the true system picture.
- Top: The top command is one of the most popular Unix / Linux performance tools. The steward can run the top command to monitor processes and overall Linux performance.
# See the performance load of memory
- Free: The free command under Linux can be used to view current system memory usage, showing the remaining and used physical and switched memory in the system, as well as shared memory and buffers used by the core.
# See the performance load of the network
- Netstat: Netstat is a console command that is a very useful tool for monitoring TCP / IP networks. It displays routing tables, actual network connections, and state information for each network interface device. Netstat is used to display data related to IP, TCP, UDP and ICMP protocols, and is generally used to check the network connection of each port of the machine.
- System Activity Reporter (System Activity Reporter) is one of the most comprehensive system performance analysis tools available on Linux today. System activities can be reported from a variety of sources, including: file reading and writing, use of system calls, disk I / O, CPU efficiency, memory usage, process activity, and IPC-related activities. In this paper, CentOS 6.3 x64 system as an example, the sar command.
# See the performance load of the disk
- Iostat: The iostat command under Linux that can be used to report central processor (CPU) statistics and input / output statistics for the entire system, adapters, tty devices, disks, and CD-ROMs.
# Nginx Configuration and Troubleshooting Guidelines
When there is a direct timeout, processing the return of the slow alarm, nginx side troubleshooting reference methods are as follows: 1, check the request log, tail-f logs / access.log, Look at the upstream_status field.
- 200: indicates normal;
- 502 / 503 / 504: Slow processing, or back-end down machine; Then look at upstream_response_time return time is really slow, there are hundreds of milliseconds, or higher, there is a problem with the back-end service.
- 404: The requested path does not exist or is not correct, and the file is not available. You need to check that the URL path you configured on the public platform is correct; Does the file or program exist on the server?
- 403: This indicates that you do not have permission to access. Check if nginx.conf has a special access configuration.
- 499: Please contact WeChat team.This error is rare.
Check the error log situation, tail-f logs / error_log, to see if there are connect () failed, Connection refused, Connection reset by peer Error log, there is a possible number of nginx connection overload and so on.
Look at the number of network connections in your system to see if there is a larger number of links
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
CLOSED //无连接是活动的或正在进行
LISTEN //服务器在等待进入呼叫
SYN_RECV //一个连接请求已经到达,等待确认
SYN_SENT //应用已经开始,打开一个连接
ESTABLISHED //正常数据传输状态/当前并发连接数
FIN_WAIT1 //应用说它已经完成
FIN_WAIT2 //另一边已同意释放
ITMED_WAIT //等待所有分组死掉
CLOSING //两边同时尝试关闭
TIME_WAIT //另一边已初始化一个释放
LAST_ACK //等待所有分组死掉
Check the system's handle configuration, ulimit-n, and check if it is too small (less than the number of requests)
Worker_rlimit_nofile, worker_connections configuration items, are they too small (less than the number of requests)