Using a Search Index

Cloudant's search is built upon Lucene and allows you to do more ad hoc queries over your data than can be done with primary and secondary indexes. If you learn better by seeing a demonstration, watch these videos first:

Provision the IBM Cloudant Service in Bluemix

If you do not already have the IBM Cloudant service provisioned in Bluemix, follow these steps to provision the service. ▼More

Replicate the sample database

You'll be working with a sample database in this tutorial. Follow these steps to replicate the sample datatabase. ▼More

Review Index functions

Search indexes are defined by a javascript function. This is run over all of your documents, in a similar manner to a view's map function, and defines the fields that your search can query. ▼More

A simple search function

function(doc){
  index("name", doc.name);
}

The function takes a single argument, the document, and calls the built-in index function to define an index on the name field.

Field names (the first argument to the index() function) cannot start with an underscore (_). If they do the document will not be indexed.

Values can only be strings, booleans or numbers (specifically 64-bit floating point). Notably, they cannot be objects, arrays, null or undefined, if they are the document will not be indexed.

Similar to views, the functions that define search indexes are stored in design documents, but under the key indexes. Under indexes you define each search index in an object, containing the index function and an optional analyzer. Details on the analyzer are below, the default is standard.

Query a search index

Search indexes are defined by a javascript function. This is run over all of your documents, in a similar manner to a view's map function, and defines the fields that your search can query. ▼More

This is the search index function in the animaldb database.

{
  "_id": "_design/views101",
  "_rev": "12-649b0e71ca89cdad5d66a4e07316726f",
  "indexes": {
    "animals": {
      "index": "function(doc){ index(\"default\", doc._id); }"
    }
  }
}

The API call below hits this search index, called animals, inside the views101 design document. As you can see, we're not specifying a field for the query (we're just using ?q=[query]), so Cloudant uses the default field, which we specified above indexes the document _id. Because animal names are stored in the _id field, the default search index is perfect for name searches, like ?q=kookaburra. Also try a search for "llama" or "elephant". Note, however, that you can always query by id using the special _id field name.

Query: https://[username].cloudant.com/animaldb/_design/views101/_search/animals?q=kookaburra

Review the Index Options

The built-in index function takes three arguments; the Lucene field, the value for that field and an optional options object. ▼More

Here's an example of a search index function.

function(doc){
  index("name", doc.name, {"store": true, "index": false});
}

The options object has two boolean keys; store and index.

Option	Description	Values	Default
`store`	If `true`, the value will be returned in the search result; if `false`, the value will not be returned in the search result.	`true`, `false`	`false`
`index`	whether the data is indexed. If set to `false`, the data can not be used for searches, but it can still be retrieved from the index if `store` is set to `true`.	`true`, `false`	`true`
`facet`	whether to enable faceting.	`true`, `false`	`false`
`boost`	To increase the relevance of the data being indexed in search results, supply a number greater than 1.0. Relevance will be adjusted by the given factor.	Any positive floating point number.	`1.0` (no boosting)

Calling the index function with both options set to false has no effect.

Attempting to index using a data field that does not exist will fail. To avoid this problem, use an appropriate guard clause.

The index function requires the name of the data field to index as the second parameter. However, if that data field does not exist for the document, an error occurs. The solution is to use a ‘guard clause’ that checks if the field exists, and contains the expected type of data, before attempting to create the corresponding index.

For example, you might use the javascript typeof function to determine the type of the data field:

if (typeof(doc.min_length) === 'number') {
  index("min_length", doc.min_length, {"store": "yes"});
}

If the field exists and has the expected type, the correct type name is returned, so the guard clause test succeeds and it is safe to use the index function. If the field does not exist, you would not get back the expected type of the field, therefore you would not attempt to index the field.

Review Analyzers

The built-in index function takes three arguments; the Lucene field, the value for that field and an optional options object. ▼More

Analyzers define how to extract index terms from text, which you might need to do if your application need to index Chinese, for example). Here's the list of generic analyzers supported by Cloudant search. See further down for language-specific analyzers.

standard	This is the default analyzer and implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
email	Like `standard` but tries harder to match an email address as a complete token.
keyword	Input is not tokenized at all.
simple	Divides text at non-letters.
whitespace	Divides text at whitespace boundaries.
classic	The standard Lucene analyzer circa release 3.1. You'll know if you need it.

You can choose which analyzer is used by your index function by changing the index definition in the design document.

Defining an analyzer

"indexes": { "mysearch" : {
  "analyzer": "whitespace", "index": "function(doc){ ... }" },
  }

Note: Changing the analyzer causes the index to be rebuilt. (Also note that queries against a given index are run with the same analyzer as is defined by the function.)

Language-specific Analyzers

We provide a large number of analyzers for specific languages. These analyzers will omit very common words in the specific language, as these tend to make poor search queries and cause considerable index bloat. Many of these also perform stemming, where common word prefixes or suffixes are removed.

Here's the full list of language-specific analyzers. ▼More

Per-Field Analyzer

Sometimes a single analyzer isn't enough. You can use the perfield analyzer to configure different analyzers for different field names. ▼More

Per-field analysis

"indexes": {
  "mysearch" : {
    "analyzer": {
      "name": "perfield",
      "default": "english",
      "fields": {
        "spanish": "spanish",
        "german": "german"
      }
    },
    "index": "function(doc){ ... }"
   }
 }

Stop words

You may want to define a set of words that do not get indexed. These are called stop words. You define stop words in the design document by turning the analyzer string into an object:

A simple stop words example

"indexes": {
  "mysearch" : {
    "analyzer": {"name": "portuguese", "stopwords":["foo", "bar", "baz"]},
    "index": "function(doc){ ... }"
  },
}

Note that keyword, simple and whitespace analyzers do not support stop words.

API options

As you probably noticed above, the search URL requires a q (or query) query string. This is the query that is passed on to the search index. There are two data types supported by search; string and number. The data type is auto detected. If you need to pass a number in as a string you will need to quote it, e.g. q="12". ▼More

The search URL can optionally take limit, include_docs, stale (which have the same behavior as those in the primary and secondary indexes) sort and bookmark.

Pagination and Sorting

Bookmarks allow you to efficiently skip through results you have already seen. All search results include a bookmark in their JSON response. By passing this value to the search URL via the bookmark query parameter you will see the next page of results.

Search results can be sorted ascending or descending by any numeric or string field in the index. Sort order is set by the sort query parameter, which takes a JSON string or list as its parameter. If the field is a string field, you have to add <string> to the end of the string. If you wanted to sort by age you'd query your search index with ?sort="age", if you wanted to sort descending you'd use ?sort="-age". If you wanted to search by name, you'd use ?sort="name<string>". Sorts can be applied to multiple fields, for instance ?sort=["-age", "height"] would sort by age descending then height ascending.

Sorting by Relevance

The default sort order (when you don't supply a sort parameter) is relevance, the highest scoring matches are returned first. If you specify a sort order then matches are returned in that order, ignoring relevance. If you want to include the relevance ordering in your sort order you can use the special fields -<score> and <score>.

Sorting By Distance

In addition to sorting by indexed fields, you can sort by distance from a point chosen at query time. You will need to index two numeric fields (representing the longitude and latitude of whatever you're indexing);

function(doc) {
  index("mylon", doc.longitude);
  index("mylat", doc.latitude);
}

You can then query using the special <distance...> sort field which takes 5 parameters;

longitude field name: The name of your longitude field ("mylon" in this example)
latitude field name: The name of your latitude field ("mylat" in this example)
longitude of origin: The longitude of the place you want to sort by distance from
latitude of origin: The latitude of the place you want to sort by distance from
units: The units to use ("km" or "mi" for kilometers and miles, respectively). The distance itself is returned in the order field

An example query to make this clear:

?sort="<distance,mylon,mylat,-0.14479689999996026,51.4964609,mi>"

You can combine sorting by distance with a bounding box query to perform simple geo operations.

Review Query Syntax

The Cloudant search query syntax is based on the Lucene syntax. ▼More

Search queries take the form of name:value (unless the name is omitted, in which case they hit the default field as we demonstrated in the first example, above).

Queries over multiple fields can be logically combined and groups and fields can be grouped. The available logical operators are: AND, +, OR, NOT and -, and are case sensitive. Range queries can run over strings or numbers.

If you want a fuzzy search you can run a query with ~ to find terms like the search term, for instance look~ will find terms book and took.

You can also increase the importance of a search term by using the boost character ^. This makes matches containing the term more relevant, e.g. cloudant "data layer"^4 will make results containing "data layer" 4 times more relevant. The default boost value is 1. Boost values must be positive, but can be less than 1 (e.g. 0.5 to reduce importance).

Wild card searches are supported, for both single (?) and multiple (*) character searches. dat? would match date and data, dat* would match date, data, database, dates etc. Wildcards must come after a search term, you cannot do a query like *base.

Result sets from searches are limited to 200 rows, and return 25 rows by default. The number of rows returned can be changed via the limit parameter. The response contains a bookmark. If the bookmark is passed back as a URL parameter you'll skip through the rows you've already seen and get the next set of results.

The following characters require escaping if you want to search on them;

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

Escape these with a preceding backslash character.

The animals database contains a design document that, amongst other things, defines a search index over the animal name, diet, minimum length, Latin name and class.

function(doc){
  index("default", doc._id);
  if(doc.min_length){
    index("min_length", doc.min_length, {"store": "yes"});
  }
  if(doc.diet){
    index("diet", doc.diet, {"store": "yes"});
  }
  if (doc.latin_name){
    index("latin_name", doc.latin_name, {"store": "yes"});
  }
  if (doc.class){
    index("class", doc.class, {"store": "yes"});
  }
}

With this index you can run any of these queries.

Desired result	Query
Birds	`class:bird`
Animals that begin with the letter "l"	`l*`
Carnivorous birds	`class:bird AND diet:carnivore`
Herbivores that start with letter "l"	`l* AND diet:herbivore`
Medium-sized herbivores	`min_length:[1 TO 3] AND diet:herbivore`
Herbivores that are 2m long or less	`diet:herbivore AND min_length:[-Infinity TO 2]`
Mammals that are at least 1.5m long	`class:mammal AND min_length:[1.5 TO Infinity]`
Find "Meles meles"	`latin_name:"Meles meles"`
Mammals who are herbivore or carnivore	`diet:(herbivore OR omnivore) AND class:mammal`

Query: https://[username].cloudant.com/animaldb/_design/views101/_search/animals?q=class:bird

Grouping Results

In addition to basic searching, you can also group results by common values of a chosen field using the group_field parameter. For full details, see Docs.

Faceted Search

Cloudant Search also supports faceted searching, which allows you to discover aggregate information about all your matches quickly and easily. You can even match all documents (using the special ?q=*:* query syntax) and use the returned facets to refine your query.

Indexing a facet is straightforward and can be strings or numbers;

function(doc) {
  index("type", doc.type, {"facet": true});
  index("price", doc.price, {"facet": true});

Once indexed, you can find out how many documents you have of any string facet with the counts= parameter, in addition to any query string you like. Example output for ?q=*:*&counts=["type"] follows;

{"total_rows":100000, "bookmark":"g...", "rows":[...],
 "counts":{"type":{"sofa":10.0, "chair":100.0}}
}

You can also perform range facet queries on numeric facets using the ranges= parameter. For example;

?q=*:*&ranges={"price":{"cheap":"[0 TO 100]","expensive":"{100 TO Infinity}"}}

The range facet syntax reuses the standard Lucene syntax for ranges (inclusive range queries are denoted by square brackets, exclusive range queries are denoted by curly brackets).

This will return output like;

"ranges":{"price":{"cheap":101.0,"expensive":99899.0}}

Using POST

Some queries can get very long or it can be difficult to URL encode your query correctly. In these cases you can use a POST instead;

{"query":"*:*", "limit":100}

The same parameters you know, but you no longer have to worry about URL length limits or URL escaping.

Example applications

To demonstrate the functionality of search we've pulled together a couple of example applications. ▼More

Find more videos and tutorials in the Learning Center.