Using a Secondary Index
Ideal for routine queries, secondary indexes use MapReduce to build indexes over large amounts of data.
If you learn better by seeing a demonstration, watch these videos first:
- How does MapReduce work?
- Build and query a secondary index
- Use advanced techniques with a secondary index
Provision the IBM Cloudant Service in Bluemix
- Visit IBM Bluemix at http://console.ng.bluemix.net.
- If you don't have a Bluemix account, click Sign Up. Complete the fields on the form, and click Create Account.
- If you have a Bluemix account, click Log In. Provide your IBMid and password, and click Log In.
- In the top navigation bar, click Catalog.
- In the left navigator, under Services, check Data and Analytics.
- Click Cloudant NOSQL DB.
- Accept the default service, or provide a different name, and click Create. This brings you to the IBM Cloudant Bluemix Service Launch Page.
- Click Service Credentials. Note your username, password, and host name. You will be able to use the URL (which also passes your username and password) to access your database from a browser.
- Paste the value for your username here:
- Click this button to update the URLs for the rest of this tutorial.
- Click Manage.
- Click Launch to load the Cloudant dashboard.
Replicate the sample database
- From the Cloudant dashboard, replicate the https://examples.cloudant.com/animaldb remote database into your account. If you need help creating the replication request, refer to the Replication video and Tutorial on the Create a Replication Job page.
- After replicating the animaldb database, in a browser, access the database to see a list of documents.
https://<account>.cloudant.com/animaldb/_all_docs?include_docs=true
Write a secondary index
Secondary indexes, or views, are defined in a map
function, which pulls out data from your documents and an optional reduce
function that aggregates the data emitted by the map
.
These functions are written in JavaScript and held in "design documents", special documents that the database knows contain these - and other - functions. We'll go into more detail about design documents in another tutorial, for now we'll just think of them as documents that define our secondary indexes.
A sample design document with MapReduce functions
{
"_id": "_design/name",
"views": {
"view1": {
"map":"function(doc){emit(doc.field, 1)}",
"reduce": "function(key, value, rereduce){return sum(values)}"
}
}
}
The naming convention for design documents is such that the name
follows _design/
in the _id
. This code defines view1
for the design document name
. Design documents can contain multiple views; each is added to the views
object.
map
functions are required for a view, a reduce
is optional.
A sample Cloudant API call
Here's what an API call to this sample function would look like, where [username]
is your username and [db_name]
is the name of your database:
https://[username].cloudant.com/[db_name]/_design/name/_view/view1
Review Map functions
This index emits the animals diet as the key, and one as the value.
function(doc) {
if(doc.diet){
emit(doc.diet, 1);
}
}
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/diet
Complex keys
A view's key can be any valid JSON data structure. We'll cover why this is particularly useful in the API section below, for now it's useful to know that lists and dictionaries can be emitted and that they will sort after numbers and strings.
This index emits the class and diet as a complex key, and one as the value.
function(doc){
if(doc.class && doc.diet){
emit([doc.class, doc.diet], 1)
}
}
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/complex_count?reduce=false
Review Reduce functions
Lets say we wanted to sum up all the values the map function emitted, that operation would be done in the reduce function. Reduces are called with three parameters; key
, values
and rereduce
. keys
will be a list of keys as emitted by the map or null, values
will be a list of values for each element in keys
, and rereduce
will be true
or false
.
The map
emits the animals diet as the key, and one as the value.
function(doc) {
if(doc.diet){
emit(doc.diet, 1);
}
}
A simplistic reduce
function. This reduce function should return the number of rows but it is broken, can you see how?
function (keys, values, rereduce){
return values.length;
}
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/diet_jscount
There may be cases where you want only the results of the map
function, even though you've added a reduce
function to your view. (I.e., you don't want a reduced result.) You don't need to write another view for that. Add reduce=false
to the query to turn off the reduce
function. (Try it, above.)
ReReduce
One common source of confusion when writing a reduce function is dealing with the rereduce=true
case. When the view is built the database arbitrarily divides the documents up into batches to process. It then merges these batches up to form the complete view result. It is when the database does this merging that
it calls the reduce function with rereduce=true
. This means the database calls the function with output from an intermediate run of the reduce
function.
You need to be careful when writing reduce functions that you take the rereduce case into account correctly. The example above didn't take this into account which is why it is broken. Well done if you spotted that! Lets look at the code in more detail:
function (keys, values, rereduce) {
return values.length;
}
When rereduce=false
the reduce function might be called with:
- keys:
[[key1, idA], [key1, idB], [key1, idC], [key2, idA], [key2, idD], [key3, idA]
- values:
[key1value1, key1value2. key1value3, key2value1, key2value2, key3value1]
The above function would correctly return 6 (the length of the values array).
In the rereduce=true
case the function will get called with an array of
counts from previous invocations:
- keys:
null
- values:
[6, 3, 7]
and will return 3, which is not the correct count; it should be 6 + 3 + 7 = 16.
The function above would be reasonable for the rereduce=false
invocation but
incorrect when it's true
. The reduce function needs to explicitly take into
account the times it is called with the result of a previous reduce:
function (keys, values, rereduce) {
if (rereduce){
// Get an array of counts, count == sum
return sum(values);
} else {
// Get a list of values, count == length
return values.length;
}
}
You'll get the same result in the rereduce=false
case but in the rereduce=true
case you'll correctly return the sum of the values.
Built-in reduces
While you can define your own reduce functions, it's often the case that your reduce is going to be doing a simple count or sum operation. There are a handful of built in reduce functions; _sum
, _count
and _stats
. If you can use these functions you should - they're faster than a javascript reduce (since they avoid serialisation between erlang and javascript) and are very well tested.
_sum
- Produces the sum of all values for a key, values must be numeric
_count
- Produces the row count for a given key, values can be any valid json
_stats
- Produces a json structure containing sum, count, min, max and sum squared, values must be numeric
To use a built-in reduce, just put its name in place of the javascript reduce function inside your view.
This map
emits the animals diet as the key, an the animals latin name as the value.
function(doc) {
if(doc.diet && doc.latin_name){
emit(doc.diet, doc.latin_name);
}
}
This built-in reduce
counts the number of rows emitted by the map
function. The rows can have any value, unlike _sum
which requires the value be a number.
_count
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/diet_count
Review the API Options
limit
, skip
, slice, include_docs
, and query for a specific key
. ▼More
limit & skip
This map
function emits the Latin name as the key, and the length of that name as the value.
function(doc) {
if(doc.latin_name){
emit(doc.latin_name, doc.latin_name.length);
}
}
This API call will limit
the results to 2, and skip
over the first 3.
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/latin_name?limit=2&skip=3
stale=ok
This code emits the Latin name as the key, and the length of that name as the value.
function(doc) {
if(doc.latin_name){
emit(doc.latin_name, doc.latin_name.length);
}
}
Pass the stale=ok
parameter to indicate that you'd rather have low latency responses than a completely up-to-date index. Omitting this parameter from your queries means that there may be times where you or your users will have to wait for the indexing to be complete.
Because we regularly update your views for you, most developers building user-facing applications on Cloudant choose the stale=ok
parameter for best, low-latency performance.
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/latin_name?stale=ok
reduce=false
If a reduce function is defined for a view that function will have been applied to the view result. As already mentioned you can query a view without the reduce step by passing in ?reduce=false
in the query.
map
emits the animals diet as the key and the Latin name as the value.
function(doc) {
if(doc.diet && doc.latin_name){
emit(doc.diet, doc.latin_name);
}
}
This built-in reduce
counts the number of rows emitted by the map
function but is disabled by querying the view with ?reduce=false
.
_count
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/diet_count?reduce=false
group=true
In the reduce
examples above, the group parameter was omitted, which generated results over all keys. If you want to return results per key, use group=true
. group=true
is an invalid for a map-only or reduce=false
view, you will get an error if you try to group a non-reduced view.
map
emits the animals diet as the key and the Latin name as the value.
function(doc) {
if(doc.diet && doc.latin_name){
emit(doc.diet, doc.latin_name);
}
}
This built-in reduce
counts the number of rows emitted by the map
function.
_count
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/diet_count?group=true
group-level
If you have a complex key you can query that key at a different group_level
. This means the reduce function can be returned at different granularities. This is very powerful for reporting data over time series; the same view can be used to answer queries about yearly activity or per second activity. If you query
with group_level
equal to or higher than the length of your key (i.e., the number of values in your complex key) you will get the same response as
querying with group=true
. Key lengths do not need to match.
function(doc){
if(doc.latin_name){
emit([doc.class, doc.diet, doc.latin_name], doc.latin_name.length)
}
}
This built-in reduce
counts the number of rows emitted by the map
function.
_count
Query: https://[username].cloudant.com/animaldb/_design/views101/_view/complex_latin_name_count?group_level=3
Try changing the group level in the URL above, you should initially see results for all levels of the key (it's queried with group_level=3
), but if you
change that to group_level=2
or group_level=1
you should see the number of animals who match the key at that group level.
Views provide a powerful way to inspect your data, beyond basic key:value look ups and range queries over _all_docs
. Building these secondary indexes
incrementally allows for rapid analysis of your data as it streams into the database.
While views are ideal for routine queries they are not well suited to ad hoc inspection of the data. For this Cloudant has developed a search tool allowing for complex, ad-hoc queries over your dataset.
Find more videos and tutorials in the Learning Center.