지난 달 28일 모바일메신저 카카오톡이 4시간동안 먹통이 된 적이 있었습니다. 카카오톡이 등장한 2010년 3월 이후 가장 오랫동안 서비스가 중단된 상황이었습니다.

같은 달 30일 카카오는 “IDC 전력계통 문제로 서비스가 4시간여 동안 중단됐다”며 “이번 장애 원인은 트래픽 과부하로 인한 전력공급에 대한 문제나, 서버군에 장애가 있었던 것은 아니라는 점은 분명히 말씀드린다”고 공식 자료를 배포했습니다.

이번의 서비스장애는 분명 카카오톡의 문제가 아니라 서버의 문제로 귀결되는 것처럼 보입니다. 하지만 사실 카카오톡의 장애는 이번이 처음이 아닙니다. 지금까지 약 4차례의 서버 장애를 경험했습니다.

카카오톡 장애로 인해 반사이익을 얻은 곳은 네이버 라인과 틱톡입니다. 두 서비스는 카카오톡과 같은 성격의 대체재입니다.

카카오톡, 네이버 라인, 틱톡 이 3개의 서비스를 모두 사용해 본 사람들은 카카오톡보다 네이버 라인이나 틱톡의 메시지 전송속도가 빠른 것을 느낄 것입니다.

다 같은 모바일메신저인데 전송속도에 차이가 있는 것은 서버 성능의 차이도 있겠지만, 애플리케이션의 아키텍쳐도 영향을 끼치지요.

이번 포스팅에서는 네이버재팬(네이버 라인은 네이버재팬이 개발)이 라인의 속도를 높이기 위해 어떠한 고민을 했는지 알아보도록 하겠습니다.

(굳이 네이버 라인을 꼽은 이유는 네이버재팬이 메신저 개발업체 중 유일하게 소스코드를 공개했기 때문입니다.)

네이버 라인의 시작부터 설명을 해야할 것 같습니다.

라인은 지난해 6월 일본 시장에 처음 등장했습니다. NHN 이해진 의장이 직접 프로젝트팀을 꾸려 선보인 모바일메신저로, 당초 NHN이 서비스하고 있던 네이버톡과 달리 인트턴트 채팅에만 초점을 맞췄습니다. 이후 네이버재팬은 모바일인터넷전화(m-VoIP)를 추가하며(2011년 10월) 사용자를 지속적으로 확보했습니다.

현재 라인은 전 세계 가입자수 3000만 명을 돌파하며 카카오톡을 추격하고 있는 중입니다.

네이버재팬 엔지니어 블로그(http://tech.naver.jp/blog/?p=1420)에 따르면 라인은 NoSQL DBMS(데이터베이스관리시스템)기반으로 만들어졌습니다.

NoSQL은 관계형 DBMS와 달리 비관계형 DBMS입니다. 때문에 대규모의 데이터를 유연하게 처리할 수 있는 특징이 있습니다.

관계형 DBMS로 모바일메신저나 소셜네트워크서비스(SNS)를 만들 경우, 새로운 업데이트가 있을 때 마다 일관성과 유효성을 체크하기 때문에 병목현상이 생길 가능성이 있습니다.

메신저 서비스에서 병목현상이란 새로운 메시지가 다량으로 송수신될 때, DBMS가 버티지 못한다는 의미입니다. 다량의 메시지를 서버가 감당하지 못한다고 해석할 수 있습니다.

이 때문에 트위터나 페이스북은 일찍부터 NoSQL을 사용하고 있습니다. 새로운 데이터(게시물)가 업데이트될 때 읽고, 쓰는 비율이 5:5가 될지라도 서비스가 유지될 수 있기 때문입니다.


다시 라인으로 돌아가면 당초 네이버재팬에서는 라인의 아키텍쳐로 레디스(Redis)를 사용했습니다. 레디스는 NoSQL 종류 중 하나입니다.

네이버재팬은 동기, 비동기가 자유롭고, 슬레이 복제도 가능하다는 레디스의 장점을 적극 살렸습니다. 그러나 레디스의 단점인 데이터 저장공간의 확장이 힘들다는 것을 간과했지요.

처음 네이버재팬에서는 라인의 사용자가 많아봤자 100만명이 안될 것이라고 예상했다고 합니다. 그러나 반년 만에 500만명의 사용자가 넘어서면서 기존에 쓰던 레디스 클러스터를 확장할 것인지, 아키텍쳐를 뜯어고칠 것인지를 고민하게 됐습니다.

그 과정에서 네이버재팬은 새로운 NoSQL을 사용하기로 결정하고 후보로 HBase, 카산드라, 몽고DB(MongoDB) 중 하나를 선택하기로 합니다. 선택기준은 모바일메신저에서 가장 중요한 세가지, 즉 확장성과 가용성, 비용이었습니다.

네이버재팬은 이 중 하둡 파일 시스템 위에서 빠르게 동작할 수 있는 HBase를 선택해 마이그레이션합니다. 데이터 저장과 가용성부분에서 카산드라가 다른 두가지 NoSQL을 제압했지만, 전체적인 요구사항을 HBase가 만족스러웠기 때문이라고 합니다.

라인은 레디스에서 HBase로 마이그레이션한 이후 더 빨라졌습니다. 클러스터를 공유할 수 있을 뿐더러 읽고 쓰는 것에 대한 균형 조정 기능도 갖추고 있어 한꺼번에 많은 데이터가 들어오더라도 해결할 수 있기 때문입니다.

기존 서비스의 한계를 클라우드와 오픈소스로 해결하고, 이 과정을 공개한 것은 동종업계에도 좋은 영향을 끼칠 것으로 보입니다.


[원문출처 : http://www.ddaily.co.kr/news/news_view.php?uid=90737]






LINE Storage: Storing billions of rows in Sharded-Redis and HBase per Month

by sunsuk7tp on 2012.4.26


Hi, I’m Shunsuke Nakamura (@sunsuk7tp). Just half a year ago, I completed the Computer Science Master’s program in Tokyo Tech and joined to NHN Japan as a member of LINE server team. My ambition is to hack distributed processing and storage systems and develop the next generation’s architecture.

In the LINE server team, I’m in charge of development and operation of the advanced storage system which manages LINE’s message, contacts and groups.

Today, I’ll briefly introduce the LINE storage stack.

LINE Beginning with Redis [2011.6 ~]

In the beginning, we adopted Redis for LINE’s primary storage. LINE is targeted for an instant messenger quickly exchanging messages, and the scale had been assumed to at most total 1 million registered users within 2011. Redis is an in-memory data store and does its intended job well. Redis also enables us to take snapshots periodically on disk and supports sync/asynchronous master-slave replication. We decided that Redis was the best choice despite the scalability and availability issues caused by the in-memory data store. The entire LINE storage system started with just a single Redis cluster constructed from 3 nodes sharded on client-side.

The larger the scale of the service, the more nodes were needed, and client-side sharding prevented us from scaling effectively. The original Redis still doesn’t support server-side sharding. So far, we have achieved a sharded redis cluster to utilize our developed clustering manager. Our sharded redis cluster is coordinated by the cluster manager daemons and ZooKeeper quorum servers.

This manager has the following characteristics:

  • Sharding management by ZooKeeper (Consistent hashing, compatible with other algorithms)
  • Failure detection and auto/manual failover between master and slave
  • Scales out with minimal downtime (< 10 sec)

Currently, several sharded Redis clusters are running with hundreds of servers.

Sharded Redis Cluster

Sharded Redis Cluster and Management tool

Tolerance Unpredictable Scaling [2011.10 ~]

However, the situation has changed greatly since then. Around October 2011, LINE experienced tremendous growth in many parts of the world, and operating costs increased as well.

A major issue of increased costs is to scale Redis Cluster in terms of capability. It’s much more difficult to operate Redis cluster to tolerance the unpredictable scale expansion because it needs more servers than the other persistent storages for the nature of in-memory data store. In order to take advantage of safely availability functionalities such as snapshot and full replication, it is necessary to adequately care of memory usage. Redis VM (Virtual Memory) is to somewhat helpful but can significantly impair performance depending on VM usage.

For the above reasons, we often misjudged the timing to scale out and encountered some outages. It then became critical to migrate to a more highly scalable system with high availability.

Over night, the target of LINE has been changed to scale 10s to 100s of millions of registered users.

This is how we tackled the problem.

Data Scalability

At first, we analyzed the order of magnitude for each database.

(n: # of Users)
(t: Lifetime of LINE System)

  • O(1)
    • Messages in delivery queue
    • Asynchronous jobs in job queue
  • O(n)
    • User Profile
    • Contacts / Groups
      • These data originally increase with O(n^2), but there are limitations on the number of links between users. (= O (n * CONSTANT_SIZE))
  • O(n*t)
    • Messages in Inbox
    • Change-sets of User Profile / Groups / Contacts

Rows stored in LINE storage have increased exponentially. In the near future, we will deal with tens of billions of rows per month.

Data Requirement

Second, we summarized our data requirements for each usage scenario.

  • O(1)
    • Availability
    • Workload: very fast reads and writes
  • O(n)
    • Availability, Scalability
    • Workload: fast random reads
  • O(n*t)
    • Scalability, Massive volume (Billions of small rows per day, but mostly cold data)
    • Workload: fast sequential writes (append-only) and fast reads of the latest data

Choosing Storage

Finally, according to the above requirements for each storage, we chose the suitable storage. As one of the criteria to configure each storage properties and determine which storage is most suitable for LINE app workloads, we benchmarked several candidates using tools such as YCSB (Yahoo! Cloud Serving Benchmark) and our own original benchmark to simulate their workloads. As a result, we decided to use HBase as the primary storage method for storing data with the exponential growth patterns such as message timeline. The characteristics of HBase are suitable for message timeline, whose workload is the latest workload, where the most recently inserted records are in the head of the distribution.

  • O(1)
    • Redis is the best choice.
  • O(n), O(n*t)
    • There are several candidates.
    • HBase
      • pros:
        • Best matches our requirements
        • Easy to operate (Storage system built on DFS, multiple ad hoc partitions per server)
      • cons:
        • Random read and deletion are somewhat slow.
        • Slightly lower avaiability (there’re some SPOF)
    • Cassandra (My favorite NoSQL)
      • pros:
        • Also suitable for dealing with the latest workload
        • High Availability (decentralized architecture, rack/DC-aware replication)
      • cons:
        • High operation costs due to weak consistency
        • Counter increments are expected to be slightly slower.
    • MongoDB
      • pros:
        • Auto sharding, auto failover
        • A rich range of operations (but LINE storage doesn’t require most of them.)
      • cons:
        • NOT suitable for the timeline workload (B-tree indexing)
        • Ineffective disk and network utilization

In summary, LINE storage layer is currently constructed as the follows:

  1. Standalone Redis: asynchronous job and message queuing
    • Redis queue and queue dispatcher are running together on each application server.
  2. Sharded Redis: front-end cache for data with O(n*t) and primary storage with O(n)
  3. Backup MySQL: secondary storage (for backup, statistics)
  4. HBase: primary storage for data with O(n*t)
    • We assume to operate hundreds of terabytes of data on each cluster with 100s to 1000 servers.

LINE main storage is constructed from about 600 nodes and continues to increase month after month.

LINE Storage Stack

LINE Storage Stack

Data Migration from Redis to HBase

We gradually migrated tens of terabytes worth of data sets from Redis cluster to HBase cluster. Specifically, we migrated in three phases:

  1. Bi-directional write to Redis and HBase and read only from Redis
  2. Run migrating script on backend (Sequentially retrieve data from Redis and write to HBase)
  3. Write to both Redis (w/ TTL) and HBase (w/o TTL) and bi-directional read from both (Redis alternatives to a cache server.)

Something to make note of is that one should avoid overwriting recent data with the older data; the migrated data are most append-only and the consistency of the other data are kept using timestamp of HBase column.

HBase and HDFS

A number of HBase clusters have been running stably for the most part on HDFS. We constructed a HBase cluster for each database (e.g., messages, contacts) and each cluster is tuned according to the workload of each database. They share a single HDFS cluster consisting of 100 servers, where each server has 32GB of memory and 1.5TB of hard disk space. Each RegionServer has 50 small regions less than a single 10GB one. Read performance for Bigtable-like architecture is impacted by (major) compaction, so each region’s size should be kept not too large size to prevent continuous major compaction, especially during peak hours. During off-peak hours, large regions are automatically split into smaller regions by a periodic cron job, while operators manually perform load balancing. Of course, HBase has auto splitting and load balancing functionalities, but we consider it best to set up manually in view of service requirements.

Thus the growth of the service, scalability is one of the important issues. We plan to place at most hundreds of servers per cluster. Each message has TTL and it is partitioned to multi-cluster in units of TTL. By doing so, the old cluster, where all of messages have expired, is full-truncated and enables to be reused as a new cluster.

Current and future challenges [2012]

Since migrating to HBase, LINE storage has been operating more stably. Each HBase cluster is current processing several times as requests as during New Year peak time. Even still, there are sometimes failures due to storage. We are left with the following availability issues for HBase and Redis cluster.

  • A redundant configuration and failover feature that does not include a single point of failure for each component including rack/DC-awareness
    • We examine replication in various layers such as full replication and SSTable or block level replication between HDFS clusters.
  • Compensation for the failures between clusters (Redis cluster, HBase, and multi-HBase cluster)

HA-NameNode

As you may already know, the NameNode is a single point of failure for HDFS. Though the NameNode process itself rarely fails (Notes: Experience at Yahoo!), other software failures or hardware failures such as disk and network failures are bound to occur. A NameNode failover procedure is thus required in order to achieve high availability.

There are the several HA-NameNode configurations:

  • High Availability Framework for HDFS NameNode (HDFS-1623)
  • Backup NameNode (0.21)
  • Avatar NameNode (Facebook)
  • HA NameNode using Linux HA
  • Active/passive configuration deploying two NameNode (cloudera)

We configure HA-NameNode using Linux HA. Each component of Linux-HA has a role similar to the following:

  • DRBD: Disk mirroring
  • Heartbeat / (Corosync): Network fail-detector
  • Pacemaker: Failover definition
  • service: NameNode, Secondary NameNode, VIP
HA NameNode using Linux HA

HA NameNode using Linux HA

DRBD (Distributed Replicated Block Device) provides block level replication; essentially it’s network-enabled RAID driver. Heartbeat monitors the status of the network between the other server. If Heartbeat detects hardware or service outages, it switches primary/secondary in DRBD and kicks each service’s daemon based on logic defined by pacemaker.

Conclusion

Thus far, we’ve faced various challenges for scalability and availability with the growth of LINE. However, LINE storage and strategies will be much more immature, given extreme scaling and the various failure cases. We would like to grow ourselves with the future growth of LINE.

Appendix: How to setup HA-NameNode using Linux-HA

In the rest of this entry, I will introduce how to build HA-NameNode using two CentOS 5.4 servers and Linux-HA. These servers are to assume the following environment.

  • Hosts:
    1. NAMENODE01: 192.168.32.1 (bonding)
    2. NAMENODE02: 192.168.32.2 (bonding)
  • OS: CentOS 5.4
  • DRBD (v8.0.16):
    • conf file: ${DEPLOY_HOME}/ha/drbd.conf
    • resource name: drbd01
    • mount disk: /dev/sda3
    • mount device: /dev/drbd0
    • mount directory: /data/namenode
  • Heartbeat (v3.0.3):
    • conf file: ${DEPLOY_HOME}/ha/haresources, authkeys
  • Pacemaker (v1.0.12)
  • service daemons
    • VIP: 192.168.32.3
    • Hadoop NameNode, SecondaryNameNode (v1.0.2, the latest edition now)

Configuration

Configure drbd and heartbeat settings in your deploy home direcoty, ${DEPLOY_HOME}.

  • drbd.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
global { usage-count no; }
 
resource drbd01 {
  protocol  C;
  syncer { rate 100M; }
  startup { wfc-timeout 0; degr-wfc-timeout 120; }
 
  on NAMENODE01 {
    device /dev/drbd0;
    disk    /dev/sda3;
    address 192.168.32.1:7791;
    meta-disk   internal;
  }
  on NAMENODE02 {
    device /dev/drbd0;
    disk    /dev/sda3;
    address 192.168.32.2:7791;
    meta-disk   internal;
  }
}
  • ha.conf
1
2
3
4
5
6
7
8
9
10
debugfile ${HOME}/logs/ha/ha-debug
logfile ${HOME}/logs/ha/ha-log
logfacility local0
pacemaker on
keepalive 1
deadtime 5
initdead 60
udpport 694
auto_failback off
node    NAMENODE01 NAMENODE02
  • haresources (Can skip this step when using pacemaker)
1
2
3
# <primary hostname> <vip> <drbd> <local fs path> <running daemon name>
NAMENODE01 IPaddr::192.168.32.3 drbddisk::drbd0 Filesystem::/dev/drbd0::/data/namenode::ext3::defaults hadoop-1.0.2-namenode
{code}
  • authkeys
1
2
auth 1
1 sha1 hadoop-namenode-cluster

Installation of Linux-HA

Pacemaker and Heartbeat3.0 packages are not included in the default base and updates repositories in CetOS5. Before installation, you first need to add the Cluster Labs repo:

1
wget -O /etc/yum.repos.d/clusterlabs.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo

Then run the following script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
yum install -y drbd kmod-drbd heartbeat pacemaker
 
# logs
mkdir -p ${HOME}/logs/ha
mkdir -p ${HOME}/data/pids/hadoop
 
# drbd
cd ${DRBD_HOME}
ln -sf ${DEPLOY_HOME}/drbd/drbd.conf drbd.conf
echo "/dev/drbd0 /data/namenode ext3 defaults,noauto 0 0" >> /etc/fstab
yes | drbdadm create-md drbd01
 
# heartbeat
cd ${HA_HOME}
ln -sf ${DEPLOY_HOME}/ha/ha.cf ha.cf
ln -sf ${DEPLOY_HOME}/ha/haresources haresources
cp ${DEPLOY_HOME}/ha/authkeys authkeys
chmod 600 authkeys
 
chown -R www.www ${HOME}/logs
chown -R www.www ${HOME}/data
chown -R www.www /data/namenode
 
chkconfig -add heartbeat
chkconfig hearbeat on

DRBD Initialization and Running heartbeat

  1. Run drbd service @ primary and secondary
  2. 1
    # service drbd start
  3. Initialize drbd and format NameNode@primary
  4. 1
    2
    3
    4
    5
    6
    # drbdadm -- --overwrite-data-of-peer primary drbd01
    # mkfs.ext3 /dev/drbd0
    # mount /dev/drbd0
    $ hadoop namenode -format
    # umount /dev/drbd0
    # service drbd stop
  5. Run heartbeat @ primary and secondary
  6. 1
    # service heartbeat start

Daemonize hadoop processes (Apache Hadoop)

When using Apache Hadoop, you need to daemonize each node such as NameNode, SecondaryNameNode in order for heartbeat process to kick them. The follow script, “hadoop-1.0.2-namenode” is an example for NameNode daemon.

  • /usr/lib/ocf/resource.d/nhnjp/hadoop-1.0.2-namenode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/sh
 
BASENAME=$(basename $0)
HADOOP_RELEASE=$(echo $BASENAME | awk '{n = split($0, a, "-"); s=a[1]; s = a[1]; for(i = 2; i < n; ++i) s = s "-" a[i]; print s}')
SVNAME=$(echo $BASENAME | awk '{n = split($0, a, "-"); print a[n]}')
 
DAEMON_CMD=/usr/local/${HADOOP_RELEASE}/bin/hadoop-daemon.sh
[ -f $DAEMON_CMD ] || exit -1
 
RETVAL=0
case "$1" in
    start)
        start
        ;;
 
    stop)
        stop
        ;;
 
    restart)
        stop
        sleep 2
        start
        ;;
 
    *)
        echo "Usage: ${HADOOP_RELEASE}-${SVNAME} {start|stop|restart}"
        exit 1
    ;;
esac
exit $RETVAL

Second, place a script for pacemaker to kick this daemon services. There are pacemaker scripts under /usr/lib/ocf/resource.d/ .

  • /usr/lib/ocf/resource.d/nhnjp/Hadoop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
#!/bin/bash
#
# Resource script for Hadoop service
#
# Description:  Manages Hadoop service as an OCF resource in
#               an High Availability setup.
#
#
#   usage: $0 {start|stop|status|monitor|validate-all|meta-data}
#
#   The "start" arg starts Hadoop service.
#
#   The "stop" arg stops it.
#
# OCF parameters:
# OCF_RESKEY_hadoopversion
# OCF_RESKEY_hadoopsvname
#
# Note:This RA uses 'jps' command to identify Hadoop process
##########################################################################
# Initialization:
 
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
 
USAGE="Usage: $0 {start|stop|status|monitor|validate-all|meta-data}";
 
##########################################################################
 
usage()
{
    echo $USAGE >&2
}
 
meta_data()
{
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="Hadoop">
<version>1.0</version>
<longdesc lang="en">
This script manages Hadoop service.
</longdesc>
<shortdesc lang="en">Manages an Hadoop service.</shortdesc>
 
<parameters>
 
<parameter name="hadoopversion">
<longdesc lang="en">
Hadoop version identifier: hadoop-[version]
For example, "1.0.2" or "0.20.2-cdh3u3"
</longdesc>
<shortdesc lang="en">hadoop version string</shortdesc>
<content type="string" default="1.0.2"/>
</parameter>
 
<parameter name="hadoopsvname">
<longdesc lang="en">
Hadoop service name.
One of namenode|secondarynamenode|datanode|jobtracker|tasktracker
</longdesc>
<shortdesc lang="en">hadoop service name</shortdesc>
<content type="string" default="none"/>
</parameter>
 
</parameters>
 
<actions>
<action name="start" timeout="20s"/>
<action name="stop" timeout="20s"/>
<action name="monitor" depth="0" timeout="10s" interval="5s" />
<action name="validate-all" timeout="5s"/>
<action name="meta-data"  timeout="5s"/>
</actions>
</resource-agent>
END
exit $OCF_SUCCESS
}
 
HADOOP_VERSION="hadoop-${OCF_RESKEY_hadoopversion}"
HADOOP_HOME="/usr/local/${HADOOP_VERSION}"
[ -f "${HADOOP_HOME}/conf/hadoop-env.sh" ] && . "${HADOOP_HOME}/conf/hadoop-env.sh"
 
HADOOP_SERVICE_NAME="${OCF_RESKEY_hadoopsvname}"
HADOOP_PID_FILE="${HADOOP_PID_DIR}/hadoop-www-${HADOOP_SERVICE_NAME}.pid"
 
trace()
{
    ocf_log $@
    timestamp=$(date "+%Y-%m-%d %H:%M:%S")
    echo "$timestamp ${HADOOP_VERSION}-${HADOOP_SERVICE_NAME} $@" >> /dev/null
}
 
Hadoop_status()
{
    trace "Hadoop_status()"
    if [ -n "${HADOOP_PID_FILE}" -a -f "${HADOOP_PID_FILE}" ]; then
        # Hadoop is probably running
        HADOOP_PID=`cat "${HADOOP_PID_FILE}"`
        if [ -n "$HADOOP_PID" ]; then
            if ps f -p $HADOOP_PID | grep -qwi "${HADOOP_SERVICE_NAME}" ; then
                trace info "Hadoop ${HADOOP_SERVICE_NAME} running"
                return $OCF_SUCCESS
            else
                trace info "Hadoop ${HADOOP_SERVICE_NAME} is not running but pid file exists"
                return $OCF_NOT_RUNNING
            fi
        else
            trace err "PID file empty!"
            return $OCF_ERR_GENERIC
        fi
    fi
 
    # Hadoop is not running
    trace info "Hadoop ${HADOOP_SERVICE_NAME} is not running"
    return $OCF_NOT_RUNNING
}
 
Hadoop_start()
{
    trace "Hadoop_start()"
    # if Hadoop is running return success
    Hadoop_status
    retVal=$?
    if [ $retVal -eq $OCF_SUCCESS ]; then
        exit $OCF_SUCCESS
    elif [ $retVal -ne $OCF_NOT_RUNNING ]; then
        trace err "Error. Unknown status."
        exit $OCF_ERR_GENERIC
    fi
 
    service ${HADOOP_VERSION}-${HADOOP_SERVICE_NAME} start
    if [ $? -ne 0 ]; then
        trace err "Error. Hadoop ${HADOOP_SERVICE_NAME} returned error $?."
        exit $OCF_ERR_GENERIC
    fi
 
    trace info "Started Hadoop ${HADOOP_SERVICE_NAME}."
    exit $OCF_SUCCESS
}
 
Hadoop_stop()
{
    trace "Hadoop_stop()"
    if Hadoop_status ; then
        HADOOP_PID=`cat "${HADOOP_PID_FILE}"`
        if [ -n "$HADOOP_PID" ] ; then
            kill $HADOOP_PID
            if [ $? -ne 0 ]; then
                kill -s KILL $HADOOP_PID
                if [ $? -ne 0 ]; then
                    trace err "Error. Could not stop Hadoop ${HADOOP_SERVICE_NAME}."
                    return $OCF_ERR_GENERIC
                fi
            fi
            rm -f "${HADOOP_PID_FILE}" 2>/dev/null
        fi
    fi
    trace info "Stopped Hadoop ${HADOOP_SERVICE_NAME}."
    exit $OCF_SUCCESS
}
 
Hadoop_monitor()
{
    trace "Hadoop_monitor()"
    Hadoop_status
}
 
Hadoop_validate_all()
{
    trace "Hadoop_validate_all()"
    if [ ! -n ${OCF_RESKEY_hadoopversion} ] || [ "${OCF_RESKEY_hadoopversion}" == "none" ]; then
        trace err "Invalid hadoop version: ${OCF_RESKEY_hadoopversion}"
        exit $OCF_ERR_ARGS
    fi
 
    if [ ! -n ${OCF_RESKEY_hadoopsvname} ] || [ "${OCF_RESKEY_hadoopsvname}" == "none" ]; then
        trace err "Invalid hadoop service name: ${OCF_RESKEY_hadoopsvname}"
        exit $OCF_ERR_ARGS
    fi
 
    HADOOP_INIT_SCRIPT=/etc/init.d/${HADOOP_VERSION}-${HADOOP_SERVICE_NAME}
    if [ ! -d "${HADOOP_HOME}" ] || [ ! -x ${HADOOP_INIT_SCRIPT} ]; then
        trace err "Cannot find ${HADOOP_VERSION}-${HADOOP_SERVICE_NAME}"
        exit $OCF_ERR_ARGS
    fi
 
    if [ ! -L ${HADOOP_HOME}/conf ] || [ ! -f "$(readlink ${HADOOP_HOME}/conf)/hadoop-env.sh" ]; then
        trace err "${HADOOP_VERSION} isn't configured yet"
        exit $OCF_ERR_ARGS
    fi
 
    # TODO: do more strict checking
 
    return $OCF_SUCCESS
}
 
if [ $# -ne 1 ]; then
    usage
    exit $OCF_ERR_ARGS
fi
 
case $1 in
    start)
        Hadoop_start
        ;;
 
    stop)
        Hadoop_stop
        ;;
 
    status)
        Hadoop_status
        ;;
 
    monitor)
        Hadoop_monitor
        ;;
 
    validate-all)
        Hadoop_validate_all
        ;;
 
    meta-data)
        meta_data
        ;;
 
    usage)
        usage
        exit $OCF_SUCCESS
        ;;
 
    *)
        usage
        exit $OCF_ERR_UNIMPLEMENTED
        ;;
esac

Pacemaker settings

First, using the crm_mon command, verify whether the heartbeat process is running.

1
2
3
4
5
6
7
8
9
10
# crm_mon
Last updated: Thu Mar 29 17:32:36 2012
Stack: Heartbeat
Current DC: NAMENODE01 (bc16bea6-bed0-4b22-be37-d1d9d4c4c213)-partition with quorum
Version: 1.0.12
2 Nodes configured, unknown expected votes
0 Resources configured.
============
 
Online: [ NAMENODE01 NAMENODE02 ]

After verifying the process is running, connect to pacemaker using the crm command and configure its resource settings. (This step is needed instead of haresource setting)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
crm(live)# configure
INFO: building help index
crm(live)configure# show
node $id="bc16bea6-bed0-4b22-be37-d1d9d4c4c213" NAMENODE01
node $id="25884ee1-3ce4-40c1-bdc9-c2ddc9185771" NAMENODE02
property $id="cib-bootstrap-options" \
        dc-version="1.0.12" \
        cluster-infrastructure="Heartbeat"
 
# if this cluster is composed of two NameNode, the following setting is need.
crm(live)configure# property $id="cib-bootstrap-options" no-quorum-policy="ignore"
 
# vip setting
crm(live)configure# primitive ip_namenode ocf:heartbeat:IPaddr \
params ip="192.168.32.3"
 
# drbd setting
crm(live)configure# primitive drbd_namenode ocf:heartbeat:drbd \
        params drbd_resource="drbd01" \
        op start interval="0s" timeout="10s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="block"
# drbd master/slave setting
crm(live)configure# ms ms_drbd_namenode drbd_namenode meta master-max="1" \
master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
 
# fs mount setting
crm(live)configure# primitive fs_namenode ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/data/namenode" fstype="ext3"
 
# service daemon setting
primitive namenode ocf:nhnjp:Hadoop \
        params hadoopversion="1.0.2" hadoopsvname="namenode" \
        op monitor interval="5s" timeout="60s" on-fail="standby"
primitive secondarynamenode ocf:nhnjp:Hadoop \
        params hadoopversion="1.0.2" hadoopsvname="secondarynamenode" \
        op monitor interval="30s" timeout="60s" on-fail="restart"

Here, ocf:${GROUP}/${SERVICE} path corresponds with /usr/lib/ocf/resource.d/${GROUP}/${SERVICE}. So you should place your original service script there. Also lsb:${SERVICE} path corresponds with /etc/init.d/${SERVICE}.

Finnaly, you can confirm pacemaker’s settings using the show command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
crm(live)configure# show
node $id="bc16bea6-bed0-4b22-be37-d1d9d4c4c213" NAMENODE01
node $id="25884ee1-3ce4-40c1-bdc9-c2ddc9185771" NAMENODE02
primitive drbd_namenode ocf:heartbeat:drbd \
        params drbd_resource="drbd01" \
        op start interval="0s" timeout="10s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="block"
primitive fs_namenode ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/data/namenode" fstype="ext3"
primitive ip_namenode ocf:heartbeat:IPaddr \
        params ip="192.168.32.3"
primitive namenode ocf:nhnjp:Hadoop \
        params hadoopversion="1.0.2" hadoopsvname="namenode" \
        meta target-role="Started" \
        op monitor interval="5s" timeout="60s" on-fail="standby"
primitive secondarynamenode ocf:nhnjp:Hadoop \
        params hadoopversion="1.0.2" hadoopsvname="secondarynamenode" \
        meta target-role="Started" \
        op monitor interval="30s" timeout="60s" on-fail="restart"
group namenode-group fs_namenode ip_namenode namenode secondarynamenode
ms ms_drbd_namenode drbd_namenode \
        meta master-max="1" master-node-max="1" clone-max="2" \
        clone-node-max="1" notify="true" globally-unique="false"
colocation namenode-group_on_drbd inf: namenode-group ms_drbd_namenode:Master
order namenode_after_drbd inf: ms_drbd_namenode:promote namenode-group:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.12" \
        cluster-infrastructure="Heartbeat" \
        no-quorum-policy="ignore" \
        stonith-enabled="false"

Once you’ve confirmed the configuration is correct, commit it using the commit command.

1
crm(live)configure# commit

Once you’ve run the commit command, heartbeat kicks each service following pacemaker’s rules.
You can monitor dead or alive using the crm_mon command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$crm_mon -A
 
============
Last updated: Tue Apr 10 12:40:11 2012
Stack: Heartbeat
Current DC: NAMENODE01 (bc16bea6-bed0-4b22-be37-d1d9d4c4c213)-partition with quorum
Version: 1.0.12
2 Nodes configured, unknown expected votes
2 Resources configured.
============
 
Online: [ NAMENODE01 NAMENODE02 ]
 
 Master/Slave Set: ms_drbd_namenode
     Masters: [ NAMENODE01 ]
     Slaves: [ NAMENODE02 ]
 Resource Group: namenode-group
     fs_namenode        (ocf::heartbeat:Filesystem):    Started NAMENODE01
     ip_namenode        (ocf::heartbeat:IPaddr):        Started NAMENODE01
     namenode   (ocf::nhnjp:Hadoop):    Started NAMENODE01
     secondarynamenode  (ocf::nhnjp:Hadoop):    Started NAMENODE01
 
Node Attributes:
* Node NAMENODE01:
    + master-drbd_namenode:0            : 75
* Node NAMENODE02:
    + master-drbd_namenode:1            : 75

Finally, you should test the various failover tests. For example, kill each service daemon and cause pseudo-network failures using iptables.

Reference documents


출처 - http://tech.naver.jp/blog/?p=1420



Posted by linuxism
,