Using a Search Index
Cloudant's search is built upon Lucene and allows you to do more ad hoc queries over your data than can be done with primary and secondary indexes. If you learn better by seeing a demonstration, watch these videos first:
Provision the IBM Cloudant Service in Bluemix
- Visit IBM Bluemix at http://console.ng.bluemix.net.
- If you don't have a Bluemix account, click Sign Up. Complete the fields on the form, and click Create Account.
- If you have a Bluemix account, click Log In. Provide your IBMid and password, and click Log In.
- In the top navigation bar, click Catalog.
- In the left navigator, under Services, check Data and Analytics.
- Click Cloudant NOSQL DB.
- Accept the default service, or provide a different name, and click Create. This brings you to the IBM Cloudant Bluemix Service Launch Page.
- Click Service Credentials. Note your username, password, and host name. You will be able to use the URL (which also passes your username and password) to access your database from a browser.
- Paste the value for your username here:
- Click this button to update the URLs for the rest of this tutorial.
- Click Manage.
- Click Launch to load the Cloudant dashboard.
Replicate the sample database
- From the Cloudant dashboard, replicate the https://examples.cloudant.com/animaldb remote database into your account. If you need help creating the replication request, refer to the Replication video and Tutorial on the Create a Replication Job page.
- After replicating the animaldb database, in a browser, access the database to see a list of documents.
https://<account>.cloudant.com/animaldb/_all_docs?include_docs=true
Review Index functions
A simple search function
function(doc){
index("name", doc.name);
}
The function takes a single argument, the document, and calls the built-in index
function to define an index on the name
field.
Field names (the first argument to the index()
function) cannot start with an underscore (_
). If they do the document will not be indexed.
Values can only be strings, booleans or numbers (specifically 64-bit floating point). Notably, they cannot be objects, arrays, null
or undefined
, if they are the document will not be indexed.
Similar to views, the functions that define search indexes are stored in design documents, but under the key indexes
. Under indexes
you define each search index in an object, containing the index
function and an optional analyzer
. Details on the analyzer
are below, the default is standard
.
Query a search index
This is the search index function in the animaldb database.
{
"_id": "_design/views101",
"_rev": "12-649b0e71ca89cdad5d66a4e07316726f",
"indexes": {
"animals": {
"index": "function(doc){ index(\"default\", doc._id); }"
}
}
}
The API call below hits this search index, called animals
, inside the views101
design document. As you can see, we're not specifying a field for the query (we're just using ?q=[query]
), so Cloudant uses the default
field, which we specified above indexes the document _id
. Because animal names are stored in the _id
field, the default search index is perfect for name searches, like ?q=kookaburra
. Also try a search for "llama" or "elephant". Note, however, that you can always query by id using the special _id
field name.
Query: https://[username].cloudant.com/animaldb/_design/views101/_search/animals?q=kookaburra
Review the Index Options
index
function takes three arguments; the Lucene field, the value for that field and an optional options object. ▼More
Here's an example of a search index function.
function(doc){
index("name", doc.name, {"store": true, "index": false});
}
The options object has two boolean keys; store
and index
.
Option | Description | Values | Default |
---|---|---|---|
store |
If true , the value will be returned in the search result; if
false , the value will not be returned in the search result. |
true , false |
false |
index |
whether the data is indexed. If set to false , the data can not be used for searches, but it can still be retrieved from the index if store is set to true . |
true , false
|
true |
facet |
whether to enable faceting. |
true , false
|
false |
boost |
To increase the relevance of the data being indexed in search results, supply a number greater than 1.0. Relevance will be adjusted by the given factor. | Any positive floating point number. | 1.0 (no boosting) |
Calling the index
function with both options set to false
has no effect.
Attempting to index using a data field that does not exist will fail. To avoid this problem, use an appropriate guard clause.
The index
function requires the name of the data field to index as the second parameter. However, if that data field does not exist for the document, an error occurs. The solution is to use a ‘guard clause’ that checks if the field exists, and contains the expected type of data, before attempting to create the corresponding index.
For example, you might use the javascript typeof
function to determine the type of the data field:
if (typeof(doc.min_length) === 'number') {
index("min_length", doc.min_length, {"store": "yes"});
}
If the field exists and has the expected type, the correct type name is returned, so the guard clause test succeeds and it is safe to use the index
function. If the field does not exist, you would not get back the expected type of the field, therefore you would not attempt to index
the field.
Review Analyzers
index
function takes three arguments; the Lucene field, the value for that field and an optional options object. ▼More
Analyzers define how to extract index terms from text, which you might need to do if your application need to index Chinese, for example). Here's the list of generic analyzers supported by Cloudant search. See further down for language-specific analyzers.
standard | This is the default analyzer and implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. |
Like standard but tries harder to match an email address as a complete token. |
|
keyword | Input is not tokenized at all. |
simple | Divides text at non-letters. |
whitespace | Divides text at whitespace boundaries. |
classic | The standard Lucene analyzer circa release 3.1. You'll know if you need it. |
You can choose which analyzer is used by your index function by changing the index definition in the design document.
Defining an analyzer
"indexes": { "mysearch" : {
"analyzer": "whitespace", "index": "function(doc){ ... }" },
}
Note: Changing the analyzer causes the index to be rebuilt. (Also note that queries against a given index are run with the same analyzer as is defined by the function.)
Language-specific Analyzers
We provide a large number of analyzers for specific languages. These analyzers will omit very common words in the specific language, as these tend to make poor search queries and cause considerable index bloat. Many of these also perform stemming, where common word prefixes or suffixes are removed.
- arabic
- armenian
- basque
- bulgarian
- brazilian
- catalan
- cjk (Chinese, Japanese, Korean)
- chinese (smartcn)
- czech
- danish
- dutch
- english
- finnish
- french
- german
- greek
- galician
- hindi
- hungarian
- indonesian
- irish
- italian
- japanese (kuromoji)
- latvian
- norwegian
- persian
- polish (stempel)
- portuguese
- romanian
- russian
- spanish
- swedish
- thai
- turkish
Per-Field Analyzer
perfield
analyzer to configure different analyzers for different field names. ▼More
Per-field analysis
"indexes": {
"mysearch" : {
"analyzer": {
"name": "perfield",
"default": "english",
"fields": {
"spanish": "spanish",
"german": "german"
}
},
"index": "function(doc){ ... }"
}
}
Stop words
You may want to define a set of words that do not get indexed. These are called stop words. You define stop words in the design document by turning the analyzer string into an object:
A simple stop words example
"indexes": {
"mysearch" : {
"analyzer": {"name": "portuguese", "stopwords":["foo", "bar", "baz"]},
"index": "function(doc){ ... }"
},
}
Note that keyword
, simple
and whitespace
analyzers do not support stop words.
API options
q
(or query
) query string. This is the query that
is passed on to the search index. There are two data types supported by search; string and number. The data type is auto detected. If you need to pass a number in as a string you will need to quote it, e.g. q="12"
. ▼More
The search URL can optionally take limit
, include_docs
, stale
(which have
the same behavior as those in the primary and secondary indexes) sort
and bookmark
.
Pagination and Sorting
Bookmarks allow you to efficiently skip through results you have already seen. All search results include a bookmark
in their JSON response. By passing this value to the search URL via the bookmark
query parameter you will see the next page of results.
Search results can be sorted ascending or descending by any numeric or string field in the index. Sort order is set by the sort
query parameter, which takes a JSON string or list as its parameter. If the field is a string field, you have to add <string>
to the end of the string. If you wanted to sort by age you'd query your search index with ?sort="age"
, if you wanted to sort descending you'd use ?sort="-age"
. If you wanted to search by name, you'd use ?sort="name<string>"
. Sorts can be applied to multiple fields, for instance ?sort=["-age", "height"]
would sort by age descending then height ascending.
Sorting by Relevance
The default sort order (when you don't supply a sort
parameter) is relevance, the highest scoring matches are returned first. If you specify a sort order then matches are returned in that order, ignoring relevance. If you want to include the relevance ordering in your sort order you can use the special fields -<score>
and <score>
.
Sorting By Distance
In addition to sorting by indexed fields, you can sort by distance from a point chosen at query time. You will need to index two numeric fields (representing the longitude and latitude of whatever you're indexing);
function(doc) { index("mylon", doc.longitude); index("mylat", doc.latitude); }
You can then query using the special <distance...>
sort field which takes 5 parameters;
- longitude field name: The name of your longitude field ("mylon" in this example)
- latitude field name: The name of your latitude field ("mylat" in this example)
- longitude of origin: The longitude of the place you want to sort by distance from
- latitude of origin: The latitude of the place you want to sort by distance from
- units: The units to use ("km" or "mi" for kilometers and miles, respectively). The distance itself is returned in the
order
field
An example query to make this clear:
?sort="<distance,mylon,mylat,-0.14479689999996026,51.4964609,mi>"
You can combine sorting by distance with a bounding box query to perform simple geo operations.
Review Query Syntax
Search queries take the form of name:value
(unless the name is omitted, in which case they hit the default
field as we demonstrated in the first example, above).
Queries over multiple fields can be logically combined and groups and fields can be grouped. The available logical operators are: AND
, +
, OR
, NOT
and -
, and are case sensitive. Range queries can run over strings or numbers.
If you want a fuzzy search you can run a query with ~
to find terms like the search term, for instance look~
will find terms book and took.
You can also increase the importance of a search term by using the boost character ^
. This makes matches containing the term more relevant, e.g.
cloudant "data layer"^4
will make results containing "data layer" 4 times more relevant. The default boost value is 1. Boost values must be positive, but
can be less than 1 (e.g. 0.5 to reduce importance).
Wild card searches are supported, for both single (?
) and multiple (*
) character searches. dat?
would match date and data, dat*
would match date, data, database, dates etc. Wildcards must come after a search term, you cannot do a query like *base
.
Result sets from searches are limited to 200 rows, and return 25 rows by default. The number of rows returned can be changed via the limit
parameter.
The response contains a bookmark
. If the bookmark
is passed back as a URL parameter you'll skip through the rows you've already seen and get the next set of results.
The following characters require escaping if you want to search on them;
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
Escape these with a preceding backslash character.
The animals database contains a design document that, amongst other things, defines a search index over the animal name, diet, minimum length, Latin name and class.
function(doc){
index("default", doc._id);
if(doc.min_length){
index("min_length", doc.min_length, {"store": "yes"});
}
if(doc.diet){
index("diet", doc.diet, {"store": "yes"});
}
if (doc.latin_name){
index("latin_name", doc.latin_name, {"store": "yes"});
}
if (doc.class){
index("class", doc.class, {"store": "yes"});
}
}
With this index you can run any of these queries.
Desired result | Query |
---|---|
Birds | class:bird |
Animals that begin with the letter "l" | l* |
Carnivorous birds | class:bird AND diet:carnivore |
Herbivores that start with letter "l" | l* AND diet:herbivore |
Medium-sized herbivores | min_length:[1 TO 3] AND diet:herbivore |
Herbivores that are 2m long or less | diet:herbivore AND
min_length:[-Infinity TO 2] |
Mammals that are at least 1.5m long | class:mammal AND
min_length:[1.5 TO Infinity] |
Find "Meles meles" | latin_name:"Meles meles" |
Mammals who are herbivore or carnivore | diet:(herbivore OR
omnivore) AND class:mammal |
Query: https://[username].cloudant.com/animaldb/_design/views101/_search/animals?q=class:bird
Grouping Results
In addition to basic searching, you can also group results by common values of a chosen field using the group_field
parameter. For full details, see Docs.
Faceted Search
Cloudant Search also supports faceted searching, which allows you to discover aggregate information about all your matches quickly and easily. You can even match all documents (using the special ?q=*:*
query syntax) and use the returned facets to refine your query.
Indexing a facet is straightforward and can be strings or numbers;
function(doc) { index("type", doc.type, {"facet": true}); index("price", doc.price, {"facet": true});
Once indexed, you can find out how many documents you have of any string facet with the counts=
parameter, in addition to any query string you like. Example output for ?q=*:*&counts=["type"]
follows;
{"total_rows":100000, "bookmark":"g...", "rows":[...], "counts":{"type":{"sofa":10.0, "chair":100.0}} }
You can also perform range facet queries on numeric facets using the ranges=
parameter. For example;
?q=*:*&ranges={"price":{"cheap":"[0 TO 100]","expensive":"{100 TO Infinity}"}}
The range facet syntax reuses the standard Lucene syntax for ranges (inclusive range queries are denoted by square brackets, exclusive range queries are denoted by curly brackets).
This will return output like;
"ranges":{"price":{"cheap":101.0,"expensive":99899.0}}
Using POST
Some queries can get very long or it can be difficult to URL encode your query correctly. In these cases you can use a POST instead;
{"query":"*:*", "limit":100}
The same parameters you know, but you no longer have to worry about URL length limits or URL escaping.
Example applications
If you'd like to replicate them into your account you are welcome to do so, but they both use sizable datasets and will use up a significant number of Cloudant units.
Full text indexing is what Lucene is built for, and Cloudant search is no different. In this Lobby Search example we've taken public lobbyist disclosure dataset from the US senate. The dataset consists of 757,123 individual documents. The uncompressed XML documents are 2.5 GB on disk, and the corresponding Cloudant database is only 1.3 GB.
Geo indexing is possible with Cloudant search. By combining location awareness with other queries you can build applications that find what a user wants, where a user is. In this Simple Geo Places example we've taken the Simple Geo "places of interest" data set of over 20 million locations and combined it with searches over other values (e.g. find restaurants near the office). A simple geo-indexer couldn't do these "refined searches" because they require additional dimensions in the query.
Find more videos and tutorials in the Learning Center.