Today, we’re giving you a crash course on Elasticsearch data types. Built on Apache Lucene, Elasticsearch is an open-source solution that provides developers with a search and analytics engine capable of working with a variety of data types.
Elasticsearch defines several data types: objects, text, numbers, geospatial, structured, aggregate, and document ranking.
If you don’t understand what those terms encompass, you’ve come to the right place. In this article, we’ll break down what some of these types of data are and give you some examples so you can build on your foundational knowledge of Elasticsearch. Once you have this foundation, you’ll be able to build better search and analytics algorithms for all your projects.
Structured vs. Unstructured Data in Elasticsearch
To really understand the different data types in Elasticsearch, you need to grasp the differences between structured and unstructured data. Structured data is really quite simple—it’s data that can be organized in a specific way using a certain format.
One great example of this that many developers can understand is a relational database. If you have a database set up to keep track of employee data, as a simple example, you might have an ID field that stores an integer, a first name field that stores a string, and a last name field that stores another string. This is structured because it has to follow the rules specified by your relational database.
Unstructured data, on the other hand, doesn’t follow many rules. Let’s say you’re working at a company with a document server that stores PDFs or text files of a variety of information. Since it’s not all the same type of document, these are considered unstructured. However, you still need to be able to search through this data.
Elasticsearch comes in and allows your company to index this unstructured data in a structured format, making searches through this server much easier, faster, and more efficient.
It does this by mapping the unstructured data in an inverted index, essentially turning your server into its own search engine. It does so by mapping your data through its textual, numerical, and geospatial data types, which you’ll learn more about throughout the rest of this article.
Textual Data Types in Elasticsearch
|text||standard field for full text|
|match_only_text||space-optimized version of text|
|annotated-text||text that contains special markup|
|completion||used for auto-complete features|
|search_as_you_type||offers as-you-type completion|
|token_count||counts the number of tokens in a text|
One of the most commonly used textual data types in Elasticsearch is the text type because its usefulness lies at the foundation of what search algorithms are used for—put simply, to search for text. It parses the content of each text data type through an analyzer that converts each string to a list of separate terms. This data type might be useful for the following types of content:
- Blog posts or articles
- Email content
- Descriptions of products on a webpage
- Electronic books
Essentially, Elasticsearch can be set up to search through any number of long or short strings of text. This feature might be used as a web search engine, a find/replace feature in word processing software, or a search feature in an email client. Without this capability, it would be much harder to parse through large collections of information.
The match_only_text type performs similarly to text, with some slight differences. One of the main reasons why this text type exists is it allows an algorithm to save resources that might be better used elsewhere. However, it does come with some downsides:
- Even though it can perform term queries faster, other queries often perform slower because it has to look at the source document before returning its results.
- The analysis can not be configured with an analyzer; you always have to use the default.
- There is no support for span queries.
Numerical Data Types in Elasticsearch
|long||signed 64-bit integer|
|integer||signed 32-bit integer|
|short||signed 16-bit integer|
|byte||signed 8-bit integer|
|double||double-precision 64-bit floating-point number|
|float||single-precision 32-bit floating-point number|
|half_float||half-precision 16-bit floating-point number|
|scaled_float||floating-point number backed by a long and scaled by a fixed double scaling factor|
|unsigned_long||unsigned 64-bit integer|
Even if you’re a new developer, you should recognize many of these data types because they are fairly common in many programming languages. Which numerical data type you plan to use will depend on your use case. As an example, here is a small table that will give you an idea of the upper and lower limits of each integer type:
|Numerical Data Type||Lower Limit||Upper Limit|
However, one of those numerical data types might need a bit more explanation. Let’s talk about the scaled_float.
Let’s say we have a document that’s indexed with this data type. There are two important things to note with this data type:
- It will be stored as an integer.
- It will also have a scaling factor.
In this scenario, if the document’s index is 5.6 with a scaling factor of 10, it would be stored as 56. In other words, you’re going to divide it by 10 to get the document’s index. On the surface, it might seem counterintuitive to store numbers this way. However, it makes your searches much more efficient when you store the floats as integers rather than floating-point numbers.
Geospatial Data Types in Elasticsearch
|geo_point||latitude and longitude points|
|geo_shape||complex shapes (ex: polygon)|
|point||arbitrary cartesian points|
|shape||arbitrary cartesian geometries|
When dealing with searches today, geospatial data can serve an integral role in how best your data serves your end users. One of the primary data types you’ll use in this scenario will be geo_point data. Here are some ways those latitude and longitude points can be useful:
- Finding geopoints within a specific distance of a location
- Aggregating documents by their distance or geographic grids
- Weaving distance into a search’s relevance
- Sorting documents by their distance
However, there might be situations when aggregating data by its shape rather than its geopoints makes more sense. A geo_shape data type might include rectangles, lines, polygons, or triangles, much like you’d find in geometry. Elasticsearch accomplishes this by turning those shapes into multidimensional points, allowing for an accurate spatial resolution of the shape.
The point data type allows you to index x, y pairs in a two-dimensional coordinate system. There are five ways Elasticsearch facilitates this indexing:
- It can be expressed as an object, having type and coordinates keys.
- It can be expressed as a point, formatted this way: “POINT(x y)”.
- It can be expressed as an object, having x and y keys.
- It can be expressed as an array, formatted this way: [ x, y ].
- It can be expressed as a string, formatted this way: “x, y”.
Similar to the point data type, Elasticsearch’s shape allows you to search with shapes (such as rectangles or polygons) using coordinates in a two-dimensional system. These shapes can be queried using shape queries, which aggregates documents that use shape index mapping.
Practical Applications of Elasticsearch Data Types
Now that you know the difference between the main Elasticsearch data types, how can you apply them to your actual projects? All of this theory is useless if you don’t have anything to apply it to. Let’s break down the best practical applications for each data type and instances where you might see them deployed.
1. Textual Data Types
Digital Publishing: Elasticsearch’s textual data types come into play in the world of digital publishing. Publishers use Elasticsearch to create full-text search capabilities on their websites. This allows users to quickly find articles, blog posts, or other pieces of content that are relevant to their search terms.
Customer Support: Textual data types also prove useful in customer support scenarios. Businesses can use Elasticsearch to index and search through a vast database of support tickets, enabling quicker resolution of customer issues.
2. Numerical Data Types
E-commerce: Elasticsearch’s numerical data types find applications in e-commerce platforms. These data types can be used for efficient sorting and filtering of products based on numerical attributes such as price, ratings, or quantities available.
Stock Market Analysis: The financial industry relies heavily on numerical data types in Elasticsearch. Stock market data can be indexed in Elasticsearch to perform real-time analytics, pattern detection, and forecasting.
3. Geospatial Data Types
Food Delivery Services: Geospatial data types play an important role in food delivery and ride-hailing services. They use Elasticsearch to quickly match customers with the closest drivers or restaurants within a specified radius.
Real Estate Platforms: Real estate platforms also benefit from Elasticsearch’s geospatial data types. Indexing properties with their geolocation allows users of the platform to search for properties within a certain geographical range or in a specific neighborhood.
This guide only scratches the surface of Elasticsearch’s capabilities. While it may seem complex at first, once you get the hang of it, Elasticsearch’s usefulness really becomes apparent. Let’s jump into some frequently asked questions.
The image featured at the top of this post is ©faithie/Shutterstock.com.