Cloud Zone is brought to you in partnership with:

I develop high loaded distributed systems on Java, and I'm trying to make the development of distributed systems using the Web languages (such as PHP, Python and Ruby) as simple as using Java. If you have the same goal, please join! Alexey has posted 1 posts at DZone. You can read more from them at their website. View Full User Profile

PHP + Java, or In-Memory Cluster for PHP Developers

06.03.2014
| 2633 views |
  • submit to reddit

Intro

Picture taken from here. Java + PHP

Last years I participate in development of distributed systems in Java, and I use in-memory data grids (IMDG) for building of such the systems. What is IMDG about? It's a durable clustered in-memory cache with the possibility of distributed data processing and writing data to persistent storage (for example relational database).
IMDG consists of set of 'caches', where 'cache' is a distributed hash table (key-value data structure).
But why IMDGs are not so popular? Because unfortunately there is no IMDG for web developers yet. Writing IMDG in PHP or Python is not so good idea because such cluster won't be so fast, thus if we need IMDG for Web, we have to use one written in Java.
There are several API's provided by different IMDGs:
  • Java/C++/C# native APIs. Not suitable for our purpose
  • REST API. Actually it's not the solution too, because this API is poor enough and doesn't allow to store objects of custom types (for example our domain objects). Also there is some overhead because of HTTP protocol
  • Some binary protocol for outer clients. This protocol is specific for every IMDG and it's tricky enough to make it work with your application and your custom type system (it forces you to write Java code)

As we can see all of these possibilities don't allow us just describe our custom type system and start working with IMDG, so we need to write our own soft which will use underlying in-memory data grid and provide easy-to-use API for web applications.

Sproot Grid

I named it Sproot Grid. It's built on top of Infinispan - open source IMDG from JBoss (belongs to RedHat) and uses Apache Thrift (open source. Developed in Facebook as a protocol for interaction between nodes of Cassandra cluster) as a protocol for serialization and remote method invocation.

Sproot Grid allows you to store built-in PHP types as well as custom domain types, and there was a challenge because there are three ways to handle with custom types:

  1. Work with fields of object using reflection. It's slow
  2. Write custom code for mapping and serialization/deserialization logic for both client and server sides, write configuration for underlying Java IMDG. This code will work fast but this way requires efforts for writing logic and in this case PHP developer has to write both Java and PHP code
  3. Generate all necessary code for mapping and serialization/deserialization logic, compile java part into .jar lib, generate configuration for underlying Java IMDG. It's most effective approach for users (PHP application developers) and most difficult for developers of IMDG

So I decided to use third approach, thus you need just write configuration file (which consists of data types' description and simple configuration of the cluster) and run build script. After that you can store and retrieve your domain objects from cluster. Please see Getting Started page on project wiki.

It's published under MIT license. Source code, wiki, distributive download page 

You can find API description in details on wiki page but briefly API is the following:

  • get($cacheName, $key)
  • getAll($cacheName, array $keys)
  • cacheSize($cacheName)
  • cacheKeySet($cacheName)
  • containsKey($cacheName, $key)
  • search($cacheName, $fieldName, $searchWord)
  • remove($cacheName, $key)
  • removeAll($cacheName, array $keys)
  • put($cacheName, $key, $domainObject)
  • putAll($cacheName, array $domainObjects)
  • clearCache($cacheName)

What is included in Sproot Grid 1.0.0:

  • horizontal scalability and true clustering with balancing of data between cluster nodes
  • API for PHP applications
  • possibility to store built-in PHP types and your own custom types
  • indexing by field and search over this index
where "horizontal scalability" means an ability to add or remove nodes of cluster on the fly, and "true clustering" means that you don't have to know which node your data actually placed on, application just need to call get(...) or put(...) on any node and data will be returned to the caller. Also clustering means that you can configure redundancy of data in terms of "how many nodes can fail simultaneously without data loss", any object you put in cluster will be backed up on different cluster nodes. If new node comes to cluster, cluster starts re-balancing and new node receives its portion of data.

What is planned for Sproot Grid 1.1.0:
  • read-through
  • write-through/write-behind

Read-Through and Write-Behind caching

Actualization of data in cache is not so trivial if you use classic caching pattern:

Read data

  • application checks data in cache. If data is present, application returns response
  • if data is not present or expired, application reads data from database, then puts data in cache and after that returns response

Write data

  • application writes data in database
  • application puts data in cache

Delete data

  • application removes data from database
  • application removes data from cache

So in this pattern application has to keep the same data in actual state in two storages. Thus there are two relations: 1) 'application -> database', 2) 'application -> cache'. But what about single relation chain 'application -> cache -> database'? There are several benefits in this configuration:

  • Application is responsible just for one storage (cache), and cache in its turn is responsible to send data in DB. So orchestration becames much simpler
  • Application doesn't have to keep cache in actual state because all the data goes through the cache
  • You are still able to control number of objects in cache using configuration of 'eviction policy'
  • It moves slow communication with database behind the scene, and application just interacts with fast cache cluster
  • If object is missed in cache it can be read from DB (see 'Read-Through' section for details)
  • You can choose sync or async 'cache -> DB' interaction mode on write operation (see 'Write-Through' and 'Write-Behind' sections for details)

Read-Through

If application calls SprootClient->get('cache-name', 'someKey') method, but there is no object in cache, cluster will try to find this object in underlying persistent storage (for example any relational DB). If object is found in DB, it will be put into cache and returned as response to application, otherwise 'null' will be returned.

Thus 'read-through' mode guarantees that data in cache will be consistent with persistent storage without additional efforts from application side.

Write-Through

Updates can be processed in two ways, first of them is named 'write-through'. If application calls SprootClient->put('cache-name', 'someKey', $entry) method and 'write-through' mode is configured, then Sproot Grid will write updates (put(...) or delete(...)) to underlying DB (or other persistent storage) in synchronous manner. It means that workflow will be as following:

  • Application calls put(...) or delete(...)
  • Sproot updates data in cache
  • Sproot writes updates to DB
  • Return control to application

So at the first glance case there is no performance benefits in this workflow, but it's not true.

Because there is no need to synchronize data in cache and in DB manually - just one call to Sproot Grid from application instead of to subsequent calls (first to DB and second to cache). And next call of get(...) will return object from cache

Write-Behind

Second way of update processing is named 'write-behind'. If application calls SprootClient->put('cache-name', 'someKey', $entry) method and 'write-behind' mode is configured, then Sproot Grid will write updates (put(...) or delete(...)) to underlying DB (or other persistent storage) in asynchronous manner. It means that workflow will be as following:

  • Application calls put(...) or delete(...)
  • Sproot updates data in cache
  • Return control to application
  • Sproot writes updates to DB in background

As we can see this approach is faster than 'write-through' because of it async nature. Sproot Grid collects batch of updates after some configured delay (for example 100 milliseconds, 3 seconds, or 30 minutes) and then writes this batch to DB.

Benefits are the following:

  • Application doesn't have to wait until updates will be written to DB
  • Application will not be affected by any DB failures (for example if DB is down or too slow) because Sproot Grid is responsible for writing updates to DB, and these updates will be written when DB becomes alive.
  • Data is available in cache even if DB is down

Since Sproot Grid is a cluster with configurable redundancy, you can be sure that data won't be lost in case some node goes down, so Sproot can be durable and fast buffer between your application and database as well as standalone clustered cache.

Conclusion

If you need clustered and scalable cache for PHP application Sproot Grid 1.0.0 is a good choice. If you need features like read-through/write-behind then wait for 1.1.0. Stay tuned!

Published at DZone with permission of its author, Alexey Olenev.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)