What is Fuzzy Matching, and Why is it Important?


What is Fuzzy Matching, and Why is it Important?

Key Points:

  • Fuzzy matching is basically when a computer, or program, tries to correctly determine the intent of an input (from a human user, say) when there is a typo, missing letter, incorrect spelling, etc.
  • The term fuzzy is used because the “Iines are blurred” when you consider how a computer normally considers data with strict logical rules.
  • Uses of fuzzy matching can be seen in Google searches, apps like Shazam, and in spell-checking.

Before we wax poetic about what leads to a need for fuzzy matching, let’s define it, shall we? The Fuzzy Matching technique in computer science (which is sometimes called Approximate String Matching) helps identify two elements of text strings that are similar but not precisely the same. In other words, it determines the degree of closeness of two different strings. This degree is determined by computing the Levenshtein distance between the two strings. The Levenshtein distance between two strings X and Y is defined as the minimum number of primitive operations necessary to convert X into Y.

Fuzzy matching is much like proofing, requiring the insertion, deletion, and substitution of characters.

The following are the three primitive string operations:

  • Insertion: The inserting of some character in between two other characters of a string, as in the transition from “bot” to “boat.”
  • Deletion: The removal of some particular character from a string, as in the transition from “boat” to “bot.”
  • Substitution: The changing of one particular character in a string to some other character while keeping the overall length of the string the same, as in the transition from “bot” to “dot.”

Formally, the Levenshtein distance between strings X and Y is defined via the following recursive piecewise function:

lev(X, Y) = {|X| if |Y|=0},
      {|Y| if |X|=0},
      {lev(tail(X), tail(Y)) if X[0]=Y[0]},
      and {1+min{lev[tail(X),Y], lev[tail(Y),X], and lev[tail(X), tail(Y)]}} otherwise,

where |X| is the length, or the number of characters in, X; tail(X) is the string formed by taking X and removing its first character; min(A, B) refers to the minimum value between A and B; and X[n] is the nth character of string X, counting from 0. Thus, X[0] refers to the first character of X.

So, Why the Need for Fuzzy Matching?

To answer that question, we have to go back a few steps. Computers have a reputation for being exact. This precision is useful and makes computers into the enormously powerful tools that we all know them to be. However, it can also lead to problems for which there are no clear “yes” or “no” answers — no solutions onto which we can pin a precise figure or assign an exact value. In such cases, some people jump to the conclusion that computers are poorly equipped or incapable of deriving solutions.

But it’s important to consider just how often we entrust computers to deal with this type of problem. If we misspell a word, we trust the spell-checker — computer algorithms that correct our mistake and suggest the proper word. When companies find themselves looking to combine data from the same customer that spans several different databases, they turn to computer algorithms for help. However, these types of problems are inexact, and many words have near-identical spellings. Some words rhyme with others while having very different meanings (for example, “blew” versus “blue,” “through” versus “threw,” “there,” “their,” or “they’re”). Discerning what was meant in each case of misspelling is not an exact science.

Similarly, if different organizational schemas divide different databases, then there’s a potentially endless list of ways by which the data can be combined. So, how can algorithms be written to enable computers to consistently accomplish that task? Additionally, translating across various human languages isn’t a simple task for a machine whose architecture is based on logic gates and binary code.

Nevertheless, it is true that computers can (and do) do such things. Their ability to reach into the realm of the imprecise and conjure up reliable solutions — mimicking what we call “common sense” — is not a magic trick, however. It ultimately rests on something known as fuzzy mathematics or fuzzy logic — special branches of logic or mathematics that, as their names imply, were designed to handle fuzzy and unclear phenomena.

Because computers are empowered to accomplish tasks that in some cases appear impossible, developments in fuzzy mathematics have had far-reaching and highly practical effects. Fuzzy matching is one such development, and it deserves particular attention for the way in which it has helped computers to organize and collate inexact information. Thanks to fuzzy matching algorithms, which can be fairly easily written in popular programming languages like Python and Java, computers can match strings together based on how close they are to one another. This simple technique is the backbone of every search engine and translation algorithm you’ve ever heard of — but its applications run far deeper.

DuckDuckGo vs Google

Search engines can use fuzzy matching to figure out the intent of a search.

How Does Fuzzy Matching Work?

While the “fuzzy matching” definition may appear rather abstract, the idea behind it is relatively simple. Let’s start with some background — the first and most foundational of which is the concept of fuzzy mathematics itself.

Fuzzy Logic, Mathematics, and Set Theory

Mathematics and logic are often thought of as precise and inflexible. Over the last 50 years or so, however, a number of mathematicians, philosophers, and logicians have reflected deeply on the topic of vagueness, imprecision, borderline cases, and the often muddled nature of everyday speech and life. Essentially, ignoring “fuzziness” is to ignore a big part of reality. The result is fuzzy mathematics.

To give a simple example, the phrase “It is hot outside” has a truth value. However, it’s also true that different people have different ideas of what they consider to be “hot” weather, different abilities to tolerate such weather, and so on. Therefore, the phrase “It is hot outside” cannot be definitively declared either “true” or “false” without additional qualification (such as classical logic would have us do).

Unlike classical logic, fuzzy logic rejects what is known as the Principle of Bivalence — the idea that every proposition p must be either “true” or “false,” and that no intermediate truth value can exist between them. Fuzzy logic is not the only form of non-classical logic that rejects this principle, but it is distinguished by the fact that it assigns every proposition p with one of an infinite variety of truth values.

To put the matter more precisely, for every proposition p, fuzzy logic ascribes some value in the closed interval [0, 1] to it, where the value 0 represents absolute falsehood, the value 1 represents absolute truth, and every intermediate value represents some degree of truth that is either closer to or further away from absolute truth. This allows fuzzy logic to treat propositions like “It is hot outside” with greater nuance.

Areas of mathematics like Set Theory also have their corresponding fuzzy equivalents. In classical Set Theory, object x is either a member of some set S, or it is not. There is no talk of “grades of membership.” However, the fuzzy set theory does allow for grades of membership for objects within sets. It does this by defining a fuzzy set A as a function from some universal set X to the interval [0, 1]:

A: X → [0,1]

The universal set X represents the universe of discourse and is always treated as a classical set. The value of A(x) therefore refers to the degree of x’s membership in A. Mapping the value 0 to some x in X means that x is not at all a member A while assigning the value 1 to it means that it is totally in A. All intermediate values have the corresponding intermediate significance.

Fuzzy Set Theory and Fuzzy Logic can be unified by the idea that every proposition involves the ascription of some predate to an object. In classical logic, every proposition can be expressed by x is P, where x is an object and P is some predate. Interpreted in the language of classical set theory, this means that the proposition x is P comes out x is in the set P, or x∈P. Since classical logic and classical set theory are connected in this way, fuzzy logic and fuzzy set theory are connected.

A complete discussion of fuzzy logic or mathematics would take us far beyond the scope of this article. However, these basic ideas reframe what’s traditionally thought of as logic. Since everything that computers do has its basis in logic and mathematics, every algorithm that attempts to solve approximations and vagueness (as all fuzzy string searching does) must dip into this area.

Boolean Logic model

A training stand with sensors, switches, and a circuit of logic elements used in student training to illustrate Boolean logic. Boolean logic is a classic form of computer logic that fuzzy matching helps work around.

The Levenshtein Distance and Some Other Algorithms

The next important concept to tackle is that of a string. Put simply, a string is a sequence of written characters. These characters can be numbers, letters, or other kinds of symbols. Strings can be any length. They can be sequences that happen to correspond to words, like the string “Hello,” or they can be sequences that amount to nonsense in any human language, like “348trhekdsvqdk&%.”

Approximate string matching is the method of taking a string from a given collection of strings and matching it to another string from that collection, where the latter string is as close as possible to the former. Evaluating the operational definition of “closeness” is where the concepts from fuzzy logic typically come in.

There are a number of different fuzzy string searching algorithms out there. Some make use of fuzzy concepts directly, in the basic architecture of the algorithm itself, while others only introduce fuzzy concepts when adapting the algorithm to a particular application.

In one sense, the Levenshtein distance function (which is defined above) doesn’t employ fuzzy concepts directly because it always provides a precise value for the distance between two strings. On the other hand, in cases where no other string is exactly identical to the given string x, the function nevertheless produces some other string as output. Thus, we can think of the Levenshtein function as finding some string Y for which the statement x is Y is “truest” in the sense relevant to fuzzy logic — i.e. the Levenshtein function finds some fuzzy set Y defined as

Y: X → [0,1],

where, given some x∈X, Y(x)≥Y(z), ∀ z∈X.

Fuzzy logic concepts can appear in other areas as well. For example, spell-checking algorithms operate according to the following basic pattern. First, the spell-checker is given some set of strings E, which would be the set of all valid words in a language like English. Then, when the user types out some string w, which is not in E (like misspelling an English word) the spell-checker uses the Levenshtein function to generate some other set of strings C. C is a subset of E, which contains all of the English words that the user might have meant to spell. C is constructed by stipulating some maximum allowable Levenshtein distance from w and populating C with all of the strings that meet that criterion. This stipulated maximum allowable distance has to be the result of some kind of stipulated fuzzy function — which is one other way in which fuzzy logic appears in connection to approximate string matching.

Then, the spell-checker calculates P(c), ∀ c∈C, which is the probability that every possible candidate word in C appears in actual use in the English language. The spell-checker will likely use some sort of machine learning algorithm to calculate this, as the only way for it to know the frequency with which various English words are used is to be fed enormous amounts of training data in the form of English text.

After this, the spell-checker uses some kind of error model to determine what correct English word the user meant by the misspelled string w. The error model can be as simple as declaring that a string of Levenshtein distance 1 from w is more likely to be what the user meant than a string of Levenshtein distance 2 from w, which is more likely to be what the user meant than a string of Levenshtein distance 3 from w, and so on. In practice, however, the error model is almost certain to be more sophisticated than this and will make use of data from actual English usage. This, again, will require the use of some kind of machine learning algorithm refined by the appropriate training data. Thanks to the error model, the spell-checker can calculate P(w|c), which is the probability that the user meant to type some English word c when he typed w.

Combining all of these things, the spell-checker will then be able to generate a few likely suggestions for what the user meant to type when he made his error. These suggestions will again be conditioned by some fuzzy function that specifies which suggestions are to be considered “likely” and which are not. If the algorithm is a good one, the suggestions will be accurate.

Other fuzzy matching algorithms are designed to have the same ultimate effect as the Levenshtein distance function but make use of fuzzy concepts in different ways. For example, the Soundex and Metaphone algorithms seek out strings that sound like other strings when spoken aloud, but which are spelled differently. For this, these algorithms need a large store of audial data that they can reduce to frequencies. They then compare those frequencies via an approximate matching that may work very much like the Levenshtein distance function. From this, they make judgments about which strings sound like which other strings.

This, in principle, lays out the basic facts of how all fuzzy matching works. Applications may range from checking spelling to merging databases with similar strings, to giving you the name of a song that you hum into your phone’s microphone. In all cases, fuzzy matching is the underlying principle.

How Does One Create a Fuzzy Matching Program?

Programs that fuzzy match are often written in programming languages like Java and Python. If you’re curious about how to write a program like this, tutorials can be found all over the internet. This particular video tutorial goes over the structure of string similarity algorithms and how they are implemented in Java. The particular algorithm considered there is called the cosine similarity algorithm. This algorithm depends on the fact that the difference between any two non-zero vectors can be represented as the cosine of the angle between them. Thus, if two vectors V and W run parallel to each other — i.e. if the angle between them is 0 — they are considered equal to one another, and if they are orthogonal to one another — that is, if the angle between them is 90 degrees — they are considered maximally dissimilar. This is because cos(0)=1 cos(90)=0. Thus, the value of cos(θ) for any angle θ tracks the degree of similarity between any two vectors that form the angle θ with one another.

This Java algorithm works by taking two strings, splitting each of them into n-gram representations, encoding each n-gram for each string into a vector, and then comparing each string’s corresponding n-gram by calculating the dot products of their associated vectors. This works because the dot product of two vectors V and W is equal to the cosine of the angle between them.

This other tutorial describes a special Python library called the fuzz which you can install and use to help you do fuzzy matching in Python. The particular algorithm that this library uses relies on the Levenshtein distance function described above.

The Origin and Discovery of Fuzzy Matching

Any discussion of the historical facts surrounding the emergence of fuzzy matching dip into the general history of fuzzy logic itself, so we’re only touching on the broad strokes. If you’re hungry for a more in-depth discussion of the history and development of fuzzy logic, check out the book Fuzzy Logic and Mathematics: A Historical Perspective, by Radim Belohlavek, Joseph W. Dauben, and George J. Klir. The book not only provides a wealth of interesting historical detail but also an excellent and rigorous discussion of the subject itself.

Thorough Review of Research Connected to Fuzzy Logic
Fuzzy Logic and Mathematics: A Historical Perspective
  • By Professors Radim Belohlávek, Joseph W. Dauben, and George J. Klir
  • Published by Oxford University Press
  • Kindle edition
  • Examines the origin and development of fuzzy logic
We earn a commission if you make a purchase, at no additional cost to you.
03/01/2024 03:10 pm GMT

The first recorded thoughts motivating fuzzy logic were given by a man widely regarded as the father of classical logic: Aristotle. In a passage in one of his logical treatises, Aristotle recognized that future contingents — that is, statements of the form, “If I go on vacation to Paris next month, I will meet my future wife there” — present a thorny problem for any logic that only deals with concrete truth values of “true” and “false.” While it’s certainly sensible to speak of such statements as having truth values, it’s far from clear what those truth values are, as these statements refer to events that have not yet happened — and may never happen. If such statements can be said to be neither true nor false in the present, do they have some other truth-value?

In other passages, Aristotle also recognized that vagueness, borderline cases, and other problems can confound standard classical logic. However, he chose to set these issues to the side and proceed with developing classical logic.

Later, in the 14th century, the theologian, philosopher, and logician William of Ockham, in his Treatise on Predestination, God’s Foreknowledge and Future Contingents, presented a labyrinthine argument demonstrating that Aristotle’s statements about future contingents imply the existence of statements that are neither true nor false but have some third kind of truth value. Like Aristotle, however, Ockham chose not to further pursue this line of thinking.

Despite the occasional inklings of classical logic limitations, logic proceeded to develop in the standard way for many centuries, essentially remaining the way George Boole’s theories it until only a few decades ago.

In 1965, the Azerbaijani and Soviet mathematician Lotfi A. Zadeh published a groundbreaking paper in which he laid out, for the very first time, all of the basic concepts of fuzzy logic and mathematics we’ve discussed, as well as a number of others. Zadeh worked on systems theory, control theory, information theory, circuit analysis, and electrical engineering throughout the 1940s and 1950s, and he eventually questioned the applicability of standard mathematics to complex systems. Inspired by the work of Ludwig von Bertalanffy, a biologist and the founder of general systems theory, Zadeh sought to devise a mathematics system that could describe things that weren’t susceptible to precision. Though his ideas were initially met with resistance, they are much more widely accepted now.

The same year that Zadeh published his paper on Fuzzy Logic and Set Theory, Russian-Soviet mathematician and scientist Vladimir Levenshtein, while doing his own work in the fields of information theory and error-correcting codes, devised the idea of the “Levenshtein Distance” and the function to compute it.

Combining these ideas led to the birth of fuzzy matching.

Applications of Fuzzy Matching

We’ve already discussed some fuzzy matching applications, but here’s a brief synopsis along with a few more added. Fuzzy matching is used in:

  • spell-check software
  • search engines (providing best results for misspelled search queries)
  • relational database management (for combining customer records, checking for duplicate records, etc.)
  • genealogical searches
  • song-finding apps like Shazam
Shazam is a song-finding app that uses fuzzy matching to “listen” to the music playing and reveal the title and artist of the particular song.

Fuzzy Matching in the Real World

Here are a few explanations of how the abstract ideas associated with fuzzy matching have impacted the real world.

In one rather famous and explosive case, the FAA used a fuzzy string matching algorithm to detect a massive case of fraud. The FAA explored one large database containing the names of 40,000 pilots living in Northern California and compared those names to the names in a database of people who were receiving Social Security disability payments. Since it is against the law to either fly a plane while disabled or to fraudulently collect disability payments, the 40 pilots whose names were found in both databases were arrested as a result, and 14 of them had their licenses revoked. Without fuzzy matching, this fraud would have likely gone undetected.

Fuzzy matching is also extremely helpful in genome sequencing. If researchers need to determine the group to which a particular genome sequence belongs, they can initiate a fuzzy matching algorithm on the sequence, find the nearest other sequence or set of sequences, and then make an educated guess about what kind of organism the unknown sequence belongs to based on the result.

Businesses that wish to create single customer views (SVCs) for repeat or regular customers by combining together disparate customer databases can do so thanks to fuzzy matching. This then allows them to serve their regulars more efficiently and offer more effective suggestions for other products they might want.

Up Next…

There are plenty of more great articles for you to keep researching about logic, history, and computers!

Frequently Asked Questions

How does fuzzy matching work?

Fuzzy matching works by combining certain general notions from fuzzy logic and mathematics with a special recursive piecewise function meant to calculate something called the Levenshtein distance between two given strings. Basically, the Levenshtein function determines the number of primitive operations — insertions, deletions, and substitutions — required to turn one string into another. Fuzzy matching then takes one string and finds the other string which has the lowest Levenshtein distance from the first string of all the strings in some given set.

What is the difference between exact matching and fuzzy matching?

The difference between an exact matching and a fuzzy matching should be easily deducible from what we have already discussed. Fuzzy matching is meant to take some given string x in a collection of strings X and find some other string z in X which is as close as possible to x — i.e. a fuzzy matching finds the string z which has the smallest Levenshtein distance from some given string x.

An exact matching, however, takes the string x and attempts to find a string with a Levenshtein distance of 0 from x — i.e. it attempts to find an exact copy of x. When no such copy exists, the exact matching algorithm simply returns no result.

Therefore, fuzzy matchings and exact matching may follow the same basic algorithmic processes but will generate different results under different circumstances.

What is fuzzy matching in SQL?

SQL, or Structured Query Language, is a programming language that was specially designed to handle the problems associated with relational database management. Databases are organized into a series of tables and rows, and various SQL commands can be used to either modify some given piece of data, modify its location in the database, or retrieve a given piece of data based on its location.

Fuzzy matching algorithms can be used to combine various databases — or various tables or rows within databases — if they have similar enough strings to one another. Thus, a fuzzy matching algorithm can be integrated pretty seamlessly into a series of SQL commands.

What is fuzzy search in a database?

Fuzzy search works exactly like fuzzy matching except that, instead of some string being given and then compared for similarity to all other strings, the user himself provides the string to be compared when he types in a search term. It’s because of fuzzy search algorithms that search engines are able to provide you with results even if you misspell a particular search query.

A fuzzy search in a database simply takes the string that you search for as input, compares it via the Levenshtein distance function to all other strings in a given database, and returns the closest result(s).

Is fuzzy matching good?

Fuzzy matching can certainly be a very powerful tool. When used properly, it can organize databases, correct spelling mistakes, refine search results, sequence genetic information, and even detect financial fraud. However, one should always bear in mind that all fuzzy matching algorithms — like the fuzzy logic and mathematics they are based on — are not absolutely accurate or foolproof. In the nature of things, they cannot be so. From time to time, they may return the wrong result, even of the portions of a given fuzzy algorithm that rely on machine learning have been well-fed with good training data.

While recognizing fuzzy matching’s power, you should be wary of its limitations. Thus, you may not want to use such algorithms to organize, collate or sift through data in situations where you absolutely cannot tolerate any errors.

Is fuzzy matching related to machine learning?

Yes. Machine learning involves the crafting of algorithms that allow computers to take in certain kinds of data and thereby gradually improve their ability to perform certain specified tasks. Fuzzy matching algorithms typically rely on machine learning in at least some form or fashion whenever they are applied to English words or sentences, or to other non-random strings which have specific meanings and interpretations.

A fuzzy matching algorithm can successfully match any two strings together without any help from machine learning whenever the only thing that matters is the two strings’ Levenshtein distance from one another. In actual practice, however, this is rarely the only thing that matters.

In spell check software, for example, a good program must make use of statistical data about how often various words are used in English in order to make the most accurate suggestions possible whenever a user misspells are word. This statistical data is fed into and assimilated by the spell-checker through a machine learning algorithm.

To top