Contents
2 Description of Alarm Content
3 Methods for Troubleshooting Various Types of Alarms
5.1 Appendix 1: List of Messages and Events Pushed by Weixin and Response Format
5.2 Appendix 2: Common Tools for Viewing Server Performance Load
5.3 Appendix 3: nginx Configuration and Troubleshooting Guide
# Overview
The Weixin Official Accounts Platform now provides the API alarm feature. When the Weixin server fails to push messages to the developer up to a preset number of times, an alarm message will be sent to the specified Weixin alarm group (Setting method: Official Accounts Platform > Development > O&M Center > API Alarm). Developers should pay attention to alarms and solve the problems immediately, so as to improve the service quality of Weixin Official Accounts.
In order to better troubleshoot based on the instance (openid and timestamp are provided) at the alarm message tail, the developer needs to add a detailed log containing key information at each layer including the access layer and the logic layer to quickly locate the problem.
There are 2 types of alarms:
- General alarms, which should be paid attention by all developers.
Type | Description |
---|---|
DNS failure | DNS resolution failed when the Weixin server pushed a message or event to the Official Account. |
DNS timeout | DNS resolution timed out when the Weixin server pushed a message or event to the Official Account. The timeout time is 5 seconds. |
Connection timeout | Timeout occurred when the Weixin server connected to the Official Account developer's server. The timeout time is 5 seconds. |
Request timeout | The developer did not respond within 5 seconds after the Weixin server pushed a message or event to the Official Account. |
Response failure | The response obtained after the Weixin server pushed a message or event to the Official Account was invalid. |
MarkFail (Auto blocking) | After the Weixin server failed to push messages or events to the Official Account a certain number of times, it stopped pushing messages temporarily and resumed after one minute. |
- Official Accounts third-party platform alarms, which should be paid attention to by those who have applied for becoming a developer of an Official Accounts third-party platform on the Weixin Open Platform (open.weixin.qq.com).
Type | Description |
---|---|
Pushing component_verify_ticket timed out | The developer did not respond within 5 seconds after component_verify_ticket was pushed. |
Pushing component_verify_ticket failed | The developer did not return success after component_verify_ticket was pushed. |
Pushing a message to the third-party platform timed out | The third-party platform did not respond within 5 seconds after a message (such as canceling authorization) was pushed to the third-party platform. |
Pushing a message to the third-party platform timed out | The third-party platform did not return success after a message (such as canceling authorization) was pushed to the third-party platform. |
The following are examples of some alarms and troubleshooting guide.
# Description of Alarm Content
Description of alarm content:
a) appid: Official Account appid
b) Name: Official Account name
c) Time: For all alarms, the time when the first exception occurred is provided (such as the time when the first timeout occurred, and the time when the first response failure occurred).
d) Content: Detailed description of the error
e) Frequency: Number of exceptions occurred
f) Error example: Some information useful in locating the problem is provided in the error example. For example, developer's IP and type of the message pushed as for the first timeout. For response failures, developer's response package as for the first response failure is also provided in error example.
Generally, the IP, time, and message type provided in the alarm can help quickly locate the cause of the problem in the third party.
Alarm example 1: Timeout alarm
Appid: wxxxxxx
Name: WxNickName
Time: 2014-12-01 20:12:00
Content: The developer did not respond within 5 seconds after the Weixin server pushed a message or event to the Official Account.
Frequency: 1272 times in 5 minutes
Error example: [IP=203.205.140.29][Event=UnSubscribe]
The alarm indicates that the developer did not return any result within 5 seconds after the Weixin server pushed an unfollowing event to the developer. During the 5 minutes from 2014-12-01 20:12:00 to 2014-12-01 20:17:00, the problem occurred 1272 times. The time of the first timeout was 2014-12-01 20:12:00, the developer's IP was 203.205.140.29, and the event was an unfollowing event.
Alarm example 2: Response failure
Appid: wxxxx
Name: WxNickName
Time: 2014-12-01 20:12:00
Content: The response obtained after the Weixin server pushed a message or event to the Official Account was invalid.
Frequency: 1320 times in 5 minutes
Error example: [Event=Click] [ip=58.248.9.218][response_length=10][response_content=Error 500:]
The alarm indicates that the developer returned an invalid result after the Weixin server pushed a custom menu tapping event to the developer. During the 5 minutes from 2014-12-01 20:12:00 to 2014-12-01 20:17:00, the problem occurred 1320 times. The time of the first response failure was 2014-12-01 20:12:00, the developer's IP was 58.248.9.218, the event was a menu tapping event, and the content returned by the third party was "Error 500:" with a length of 10 bytes.
Alarm example 3: Connection timeout
Appid: wxxxx
Name: WxNickName
Time: 2015-02-04 20:13:09
Content: Timeout occurred when the Weixin server connected to the Official Account developer's server. The timeout time is 5 seconds.
Frequency: 7289 times in 5 minutes
Error example: [IP=180.150.190.135][Msg=Text]
The alarm indicates that the Weixin server was unable to connect to the server address entered by the developer when pushing a text message sent by the follower to the developer. During the 5 minutes from 2015-02-04 20:13:09 to 2015-02-04 20:18:00, the problem occurred 7289 times. The time of the first connection timeout was 2015-02-04 20:13:09, the developer's IP was 180.150.190.135, and the event type was user-pushed message.
# Methods for Troubleshooting Various Types of Alarms
1. DNS failure
This error means that the Weixin server failed to resolve the DNS when it pushed a message to the developer. In case of this alarm, confirm:
a) whether the URL or the domain name entered was correct.
b) whether the domain name changed, expired, or was updated.
If the error is not caused by the above two reasons, contact the Weixin Official Accounts Platform.
2. DNS timeout
This error does not occur currently.
3. Connection timeout
This error means that the Weixin server failed to connect to the developer's server within 3 seconds. The alarm message provides the time when the connection failed for the first time and the IP for connection. In case of this alarm, confirm:
a) whether the IP was correct.
b) whether the server of the IP was overloaded and had too many connections.
c) whether the hosting provider failed, if server hosting is provided by a third party.
d) whether the network operator failed.
e) whether network policies such as a firewall were set. A whitelist of Weixin server IPs can be created. For more information, see
[Getting Weixin Server IP Address](/doc/offiaccount/Basic_Information/Get_the_Weixin_server_IP_address.html)
f) whether the networks were different, which can be checked according to [Network Detection](/doc/offiaccount/Basic_Information/Network_Detection.html).
Get Weixin Server IP: View documentation
Network Detection: View documentation
4. Request timeout
This error means that the developer did not respond within 5 seconds after the Weixin server pushed a message or event to the developer's server. The alarm message provides the time when the request first timed out, the developer's IP, and the message type. In case of this alarm, confirm:
a) whether the IP was correct.
b) whether the IP received a request of the message type given by the alarm message.
c) whether the request was processed too long.
5. Response failure
This error means that the developer did not reply to the message in the reply message format specified in wiki, or a network error occurred. The alarm message provides the time when the request response failed for the first time, the developer's IP, the message type, and the response message content. In case of this alarm, confirm:
a) whether the IP was correct.
b) whether a network error occurred in the IP.
c) whether the business processing logic failed to reply to the message as per the wiki specification, or it entered the incorrect logic.
6. MarkFail (Auto blocking)
The Weixin backend records the number of developer's failures in real time. When the Weixin server fails to push messages to the developer many times, it automatically blocks the developer and does not push any messages to the developer for 1 minute, and sends an alarm to the Weixin group. This alarm is at the highest level, and when receiving this alarm, the developer should handle the backend failure as soon as possible to resume service. In fact, the developer will inevitably receive an alarm such as connection timeout, request timeout or response failure before receiving this alarm, and at this point, the developer must solve these failures immediately to avoid being blocked by the Weixin server, which seriously affects the Official Account service.
7. Pushing component_verify_ticket timed out & 8. Pushing component_verify_ticket failed & 9. Pushing component message timed out & 10. Pushing component message failed
These four alarms will only be received by the developers of Official Accounts third-party platforms. Since Official Accounts third-party platforms carry more Official Accounts, higher requirements for their service quality are set, and thus the four special events are separately alarmed. The troubleshooting methods are the same as those provided in 4 and 5. For more information on the application for and development of Official Accounts third-party platforms, go to the Weixin Open Platform (open.weixin.qq.com).
# FAQ
1. How do I troubleshoot DNS failures?
1.1 Ping the domain name in the url configured on your MP to see whether you can get the correct IP. If you cannot get an IP or get an incorrect one, check the configuration on your domain name hosting provider management system.
1.2 If the correct IP can be obtained in step 1 but the DNS failure alarm still occurs, perform the test again using the DNS server 182.254.116.116. For Linux, perform the dig test in the format of dig @182.254.116.116 domain name. For Windows, modify the DNS server address in network configuration and then ping the domain name. If you get an incorrect IP or cannot get an IP, contact the Weixin Team.
2. How do I solve the connection timeout problem?
2.1 Check for network environment issue.
(1) Use the Official Accounts Platform API to get the IP of the Weixin callback server. See https://api.weixin.qq.com/cgi-bin/getcallbackip?access_token=ACCESS_TOKEN.
(2) Perform the ping test on your service to check the network quality from your server to the Weixin callback server. If there is any network issue, contact your server provider.
2.2 View the number of connections to the access layer server, load, nginx configuration, and the number of connections allowed. Check nginx error logs to see whether there is any "Connection reset by peer" or "Connection timed out" error log. If yes, it means the number of nginx connections exceeds the load.
2.3 It is recommended to build a test tool to perform a heartbeat check on the system, and monitor the system load, number of connections, number of connections processed, and time taken in processing in real time and give alarms.
For nginx configuration, we provide you with a link to the official documentation on the configuration introduction (http://nginx.org/en/docs/), and you can focus on connections configuration, and log configuration, etc. Here are the examples of some important configurations for nginx:
worker_processes 16; //Number of CPU cores
error_log logs/error.log info; //Error log
worker_rlimit_nofile 102400; //Maximum number of handles to be opened
events {
worker_connections 102400; //Maximum number of connections allowed
}
//Request log record. Key fields: request_time - Total time of request, upstream_response_time - Backend processing time
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" "$host" "$cookie_ssl_edition" '
'"$upstream_addr" "$upstream_status" "$request_time" '
'"$upstream_response_time" ';
access_log logs/access.log main;
3. How do I solve the request timeout problem?
Each module requires a complete log which records the time consumed by each request in each module. By combining the time and the information provided by the Weixin alarm, you can easily locate the faulty server. Common reasons are:
1) Server overload, which leads to increased time consumed.
2) Server processing exception, which leads to message loss.
3) Server exception. You should fix the bug for server processing exception and block the faulty server for server exception. We also provide feasible solutions to server overload. Solution 1: Optimize performance and expand capacity. Check the load condition (cpu, memory, io, and network. See Appendix) and use different optimization methods according to specific performance bottlenecks. Solution 2: Asynchronous processing. If the message pushed by the Weixin server cannot be processed in real time, you can store the message and return "success" to the Weixin server first and process the message later at the backend. If you need to reply to the user message, you can call the "Customer Service Messaging" API.
4. How do I solve the access_token storage and usage problem?
Third parties often report the issue of access_token causing service interruption. The Official Accounts Platform finds that most third parties are used to refreshing the access_token frequently, which makes the access_token exceed the API frequency limit and become invalid. Here is a simple solution to access_token storage and usage.
1) The central control server calls the Weixin api at a regular interval (1 hour recommended) to refresh the access_token, and store the new access_token in mysql (or other storage).
2) Every time other working servers call the Weixin API, they obtain the access_token from mysql (or other storage) and can store it in memory for a while (1 minute recommended).
The Official Accounts Platform ensures that the old access_token can still be used within 5 minutes after the access_token is refreshed to avoid third party's failure to call the Weixin API while the access_token is updated.
# Appendix
# Appendix 1: List of Messages and Events Pushed by Weixin and Response Format
For details, see Message Push and Event Description.
# Appendix 2: Common Tools for Viewing Server Performance Load
The following are brief introductions to the common tools for viewing server performance load. For detailed usage of the tools, see respective documentation.
1. View the performance load of CPU
a) uptime
It is used to observe the overall load of the server. The system load refers to the average length of the run queues (1 minute, 5 minutes, and 15 minutes ago), which should be less than the number of CPUs normally.
b) vmstat
vmstat (Virtual Memory Statistics) monitors the virtual memory, processes, and CPU activity of operating systems. It makes statistics on the overall condition of the system and usually performs tests using the vmstat 5 5 (indicating that data is generated every 5 seconds, for five times) command. By using vmstat, you can get a summarization of data to reflect the actual system condition.
c) top The top command is one of the most popular Unix/Linux performance tools. System admin can run the top command to monitor the processes and the overall performance of Linux.
2. View the performance load of memory
a) free
The free command in Linux can be used to view current system memory usage, including the remaining and the used physical and swap memories in the system, the shared memory, and the buffer used by the core.
3. View the performance load of network
b) netstat
Netstat is a console command and a very useful tool for monitoring TCP/IP networks. It displays route tables, actual network connections, and status information of each network interface device. Netstat is used to display statistics related to IP, TCP, UDP, and ICMP protocols. It is generally used to check the network connection of each port of the server.
c) sar
sar (System Activity Reporter) is one of the most comprehensive system performance analysis tools in Linux. It can report system activities from various aspects, including file read and write, system call usage, disk I/O, CPU efficiency, memory usage, process activity, and IPC related activities. This document mainly introduces the sar command used in the CentOS 6.3 x64 system.
4. View the performance load of disk
a) iostat
The iostat command in Linux can be used to report CPU statistics and input/output statistics of the entire system, adapters, tty devices, disks, and CD-ROM.
# Appendix 3: nginx Configuration and Troubleshooting Guide
How to troubleshoot nginx issues
In case of an alarm of direct timeout or slow processing and response, you can troubleshoot at the nginx side as follows: 1. Check the request logs by running tail -f logs/access.log and view the upstream_status field.
200: indicates normal.
502/503/504: indicates slow processing, or backend is down. Check whether the time returned by upstream_response_time is really long, like hundreds of milliseconds or longer. If yes, there is a problem with the backend service.
404: indicates that the requested path does not exist or is incorrect, or the file does not exist. Check whether the url path configured on the Official Accounts Platform is correct, or whether the file or program exists on the server.
403: indicates no access permission. Check whether there is special access configuration for nginx.conf.
499: indicates a client issue, and you should contact the Weixin Team. This error is rare.
- Check the error logs by running tail -f logs/error_log to see whether there are error logs of connect() failed, Connection refused, or Connection reset by peer. If yes, there may be the case where the number of connections in nginx exceeds the load.
(1) Check whether there are too many network connections in the system
# netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
Description: CLOSED //No connection is active or ongoing LISTEN //The server is waiting to enter the call SYN_RECV //A connection request has arrived and is waiting to be confirmed
SYN_SENT //The application has started. Open a connection ESTABLISHED //Normal data transfer status/Current concurrent connections FIN_WAIT1 //The application says it has completed FIN_WAIT2 //The other side has agreed to release ITMED_WAIT //Waiting for all groups to die CLOSING //Both sides try to close simultaneously TIME_WAIT //The other side has initialized a release LAST_ACK //Waiting for all groups to die
(2) Check whether the number of handles in the system (ulimit -n) is too small (less than the number of requests)
(3) Check whether worker_rlimit_nofile and worker_connections values are too small (less than the number of requests)