Open Source Developer's Hub: May 2011

Tuesday, May 17, 2011

MySQL Split String Function

MySQL does not include a function to split a delimited string. However, it’s very easy to create your own function.

Create function syntax

A user-defined function is a way to extend MySQL with a new function that works like a native MySQL function.

CREATE [AGGREGATE] FUNCTION function_name
RETURNS {STRING|INTEGER|REAL|DECIMAL}

To create a function, you must have the INSERT privilege for the <mysql> database.

Split delimited strings

The following example function takes 3 parameters, performs an operation using an SQL function, and returns the result.

Function

CREATE FUNCTION SPLIT_STR(
  x VARCHAR(255),
  delim VARCHAR(12),
  pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
       LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
       delim, '');

Usage

SELECT SPLIT_STR(string, delimiter, position)

Example

SELECT SPLIT_STR('a|bb|ccc|dd', '|', 3) as third;

+-------+
| third |
+-------+
| ccc   |
+-------+

Tuesday, May 10, 2011

Deploying Cassandra across Multiple Data Centers

Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment. These features are robust and flexible enough that you can configure the cluster for optimal geographical distribution, for redundancy for failover and disaster recovery, or even for creating a dedicated analytics center replicated from your main data storage centers.

Settings central to multi-data center deployment include:

Replication Factor and Replica Placement Strategy – NetworkTopologyStrategy (the default placement strategy) has capabilities for fine-grained adjustment of the number and location of replicas at the data center and rack level.

Snitch – For multi-data center deployments, it is important to make sure the snitch has complete and accurate information about the network, either by automatic detection (RackInferringSnitch) or details specified in a properties file (PropertyFileSnitch).

Consistency Level – Cassandra provides consistency levels that are specifically designed for scenarios with multiple data centers: LOCAL_QUORUM and EACH_QUORUM. Here, “local” means local to a single data center, while “each” means consistency is strictly maintained at the same level in each data center.

Putting it all Together

Your specific needs will determine how you combine these ingredients in a “recipe” for multi-data center operations. For instance, an organization whose chief aim is to minimize network latency across two large service regions might end up with a relatively simple recipe for two data centers like the following:

Replica Placement Strategy: NetworkTopologyStrategy (NTS)

Replication Factor: 3 for each data center, as determined by the following strategy_options settings incassandra.yaml:

strategy_options:
DC1 : 3
DC2 : 3

Snitch: RackInferringSnitch. Administrators configure the network topology of the two data centers in such a way that Cassandra can accurately extrapolate the details automatically with RackInferringSnitch.

Write Consistency Level: LOCAL_QUORUM

Read Consistency Level: LOCAL_QUORUM

For all applications that write and read to Cassandra, the default consistency level for both reads and writes isLOCAL_QUORUM. This provides a reasonable level of data consistency while avoiding inter-data center latency.

Visualizing It

In the following depiction of a write operation across our two hypothetical data centers, the darker grey nodes are the nodes that contain the token range for the data being written.

Apache Cassandra Multi datacenter replication

Note that LOCAL_QUORUM consistency allows the write operation to the second data center to be anynchronous. This way, the operation can be marked successful in the first data center – the data center local to the origin of the write – and Cassandra can serve read operations on that data without any delay from inter-data center latency.

Learning More about It

For more detail and more descriptions of multiple-data center deployments, see Multiple Data Centers in the DataStax reference documentation. And make sure to check this blog regularly for news related to the latest progress in multi-DC features, analytics, and other exciting areas of Cassandra development.

3 Responses to “Deploying Cassandra across Multiple Data Centers”

Sudesh says:

March 23, 2011 at 10:08 pm

I have 4 node cluster setup across 2 DC. DC1 contains 10.28.7.209 and 10.28.7.218. DC2 contains 10.40.7.206 and 10.40.7.191.

I have keyspace with NTS and strategy_options= DC1: 2 , DC2:2.

I have snitch as PropertyFileSnitch.

Topology.properties has definitions :
10.28.7.209=DC1:RAC1
10.28.7.218=DC1:RAC2
10.40.7.206=DC2:RAC1
10.40.7.191=DC2:RAC2
default=DC1:r1

Tokens are balanced as per:
bin/nodetool -h 10.40.7.206 ring
Address Status State Load Owns Token
131291297286969809762394636651102920798
10.40.7.191 Up Normal 11.01 GB 25.02% 3713143261524536428796724100157456993
10.28.7.218 Up Normal 21.79 GB 25.00% 46247024543280544328625128736684811735
10.40.7.206 Up Normal 11.01 GB 24.97% 88731726328828585514925390034962410388
10.28.7.209 Up Normal 22.19 GB 25.01% 131291297286969809762394636651102920798

WIth randompartitioner I inserted 1M records, and was expecting 1M rows on each node. I ended up with 1M each on DC1:RAC1 and DC2:RAC2, while ended up with just 75K each on DC2:RAC1 and DC2:RAC2.

I copied same topology.properties files on all 4 nodes.

Any clues??
Sudesh says:

March 23, 2011 at 10:10 pm

Sorry typo …
With randompartitioner I inserted 1M records, and was expecting 1M rows on each node. I ended up with 1M each on DC1:RAC1 and DC1:RAC2, while ended up with just 75K each on DC2:RAC1 and DC2:RAC2.
Eric says:

March 25, 2011 at 1:57 pm

Sudesh, from my own testing and other sources I depend on as a technical writer, I have not been able to get a definite answer to your question. The initial tokens and other configuration details look fine, and there is no clear reason for your load to fall out of balance.

I strongly suggest posting the same question on the Cassandra user list and letting the collective mind of the community explore the issue.

Tuesday, May 3, 2011

Unix Crontab Configuration

Introduction

cron is a utility that you can use to schedule and automate tasks. By defining items in the cron table, called crontab, you can schedule any script or program to run on almost any sort of schedule. For example, Research Flagship Merchant Services on Thursday at 6:30pm.

For example, run a program each day 5 minutes after midnight on mondays, wednesdays and fridays. Or schedule something to run every five minutes, or once a month.

Basics

Each user has their own crontab, the scheduled scripts run as that user take this in account with regards to permissions. To edit the crontab use the following command:
$ crontab -e

You can list what your currnet crontab is using the following command:
$ crontab -l

Crontab Format
The following is the format entries in a crontab must be. Note all lines starting with # are ignored, comments.

# MIN   HOUR   MDAY  MON  DOW   COMMAND 
   5     *      *     *    *    echo 'Hello' 

Item Definition Valid Values
MIN Minute 0-60
HOUR Hour [24-hour clock] 0-23
MDAY Day of Month 1-31
MON Month 1-12 OR jan,feb,mar,apr …
DOW Day of Week 0-6 OR  sun,mon,tue,wed,thu,fri,sat
COMMAND Command to be run Any valid command-line
Examples
Here are a few examples, to see what some entries look like.
#Run command at 7:00am each weekday [mon-fri] 
00 07 * * 1-5 mail_pager.script 'Wake Up'

#Run command on 1st of each month, at 5:30pm
30 17 1 * * pay_rent.script
#Run command at 8:00am,10:00am and 2:00pm every day
00 8,10,14 * * * do_something.script
#Run command every 5 minutes during market hours
*/5 6-13 * * mon-fri get_stock_quote.script
#Run command every 3-hours while awake
0 7-23/3 * * * drink_water.script

Special Characters in Crontab
You can use an asterisk in any category to mean for every item, such as every day or every month.
You can use commas in any category to specify multiple values. For example: mon,wed,fri
You can use dashes to specify ranges. For example: mon-fri, or 9-17
You can use forward slash to specify a repeating range. For example: */5 for every five minutes, hours, days
Special Entries
There are several special entries, some which are just shortcuts, that you can use instead of specifying the full cron entry.
The most useful of these is probably @reboot which allows you to run a command each time the computer gets reboot. This could be useful if you want to start up a server or daemon under a particular user, or if you do not have access to the rc.d/init.d files.
Example Usage: 
# restart freevo servers
@reboot freevo webserver start
@reboot freevo recordserver start
The complete list:

Entry Description Equivalent To
@reboot Run once, at startup. None
@yearly Run once a year 0 0 1 1 *
@annually (same as @yearly) 0 0 1 1 *
@monthly Run once a month 0 0 1 * *
@weekly Run once a week 0 0 * * 0
@daily Run once a day 0 0 * * *
@midnight (same as @daily) 0 0 * * *
@hourly Run once an hour 0 * * * * 

Miscelleanous Issues
Script Output
If there is any output from your script or command it will be sent to that user’s e-mail account, on that box. Using the default mailer which must be setup properly.
You can set the variable MAILTO in the crontab to specify a separate e-mail address to use. For example:

MAILTO="admin@mydomain.com"
Redirect Output to /dev/null
You can redirect the output from a cron script to /dev/null which just throws it away. By redirecting to /dev/null you will not receive anything from the script, even if it is throwing errors.

* * * * * /script/every_minute.pl > /dev/null 2>&1
Missed Schedule Time
Cron does not run a command if it was missed. Your computer must be running for cron to run the job at the time it is scheduled. For example, if you have a 1:00am scheduled job and your computer was off at that time, it will not run the missed job in the morning when you turn it on.



CRON Examples
The following are a few CRON examples, and how to set them up in both interfaces.
Example 1: Every 5 Minutes
Run every 5 minutes.
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
Every 5 All All All All
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
*/5 * * * *
Example 2: Yearly
Run yearly (at exactly midnight on January 1st).
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
0 Midnight 1 Jan All
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
0 0 1 1 *
Example 3: Monthly
Run monthly (at 2:15AM on the 5th of each month).
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
15 2AM 5 All All
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
15 2 5 * *
Example 4: Weekly
Run weekly (at 4:32PM on every Thursday).
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
32 4PM All All Thu
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
32 16 * * 4
Example 5: Daily
Run daily (at 12:45AM every day).
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
45 Midnight All All All
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
45 0 * * *
Example 5: Hourly
Run hourly (at 24 minutes past the hour).
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
24 All All All All
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
24 * * * *
Example 6 - Complex 1
Run 52 minutes after the hour every 4 hours (e.g. 12:52AM, 4:52AM, 8:52AM, etc…).
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
52 Every 4 All All All
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
52 */4 * * *
Example 7 - Complex 2
Run 8, 22, and 47 minutes after the hour at 2AM and 2PM (e.g. 2:08AM, 2:22AM, 2:47AM, 2:08PM, 2:22PM, 2:47PM) of every third month.
SIMPLE INTERFACE

Minute Hour Day Month Day of Week
8,  22,  47 2AM,  2PM All Jan,  Apr,  Aug,  Dec All
ADVANCED INTERFACE

Minute Hour Day Month Day of Week
8,22,47 2,14 * */3 *

Item	Definition	Valid Values
MIN	Minute	0-60
HOUR	Hour [24-hour clock]	0-23
MDAY	Day of Month	1-31
MON	Month	1-12 OR jan,feb,mar,apr …
DOW	Day of Week	0-6 OR sun,mon,tue,wed,thu,fri,sat
COMMAND	Command to be run	Any valid command-line

Entry	Description	Equivalent To
@reboot	Run once, at startup.	None
@yearly	Run once a year	0 0 1 1 *
@annually	(same as @yearly)	0 0 1 1 *
@monthly	Run once a month	0 0 1 * *
@weekly	Run once a week	0 0 * * 0
@daily	Run once a day	0 0 * * *
@midnight	(same as @daily)	0 0 * * *
@hourly	Run once an hour	0 * * * *

Tuesday, May 17, 2011

Create function syntax

Split delimited strings

Tuesday, May 10, 2011

Putting it all Together

Visualizing It

Learning More about It

3 Responses to “Deploying Cassandra across Multiple Data Centers”

Tuesday, May 3, 2011

Introduction

Basics

Examples

Special Characters in Crontab

Special Entries

Miscelleanous Issues

Followers

`Special Characters in Crontab`

`Special Entries`

`Miscelleanous Issues`