Lucene Searching

This document covers advanced techniques for Lucene database searching. Lucene is an open-source database search engine which has been implemented in some PCRecruiter databases. It acts as a companion to the native SQL search functions whenever keywords or wildcards are used.

Lucene indexing offers advantages such as:

  • Sortable Keyword Search result columns β€” Lucene results can be sorted by their result columns, whereas PCR Keyword 2 results can only be sorted if they are ‘field search’ results with no keyword searching.
  • Color / Count Match Indicators β€” Lucene search results will include a green box to indicate the strength of the match. The deeper the saturation of the green, the better match the record is likely to be for your given query. Mousing over the ‘R’,’N’,’P’, or ‘K’ within the green box will show the number of times each matching term was found in that record.
  • Searching the contents of uploaded file Attachments β€” Lucene is able to search within the text of documents attached to records. (Keyword 2 search only searches within resumes, notes, keywords, and profiles.)
  • Searching Notes and Keywords separately β€” Keyword 2 searches the contents of Notes and Keywords together, whereas Lucene can search them as discrete elements.

To check your database, visit the Advanced Search and look for Keyword Version: Lucene at the upper right. If you see Keyword Version 2, this document does not apply to your database.

How is Lucene used? #

Many searches in PCRecruiter use only the native SQL search functions. The direct search within SQL is more efficient and will return results more quickly than the Lucene index.

However, if your database is using the Lucene engine, direct field searches that involve no wildcards or keywords will use the native SQL engine, while those that require keyword searches or wildcard characters such as * or % will call upon the Lucene indexes.

Lucene may be used in searching: Predefined and Custom fields, Resumes, Attachments, Notes, Keywords, Summaries, Profiles.

Lucene is NOT used in: Activity search, Email search, or Rollups Search Utility.

For example, a Basic Search for Title | LIKE | Programmer would not involve Lucene, but using the wildcard to find Title | LIKE | *Programmer would. Similarly, using the ‘Keywords’ area to find resumes or notes containing the word ‘programmer’ would also involve Lucene.

When using the Advanced Search, the same rules apply – if any of your field terms include a wildcard, or you are combining a search within profiles, notes, resumes, keywords, etc. with your field searches, you’ll be using Lucene.

Limitations #

  • Searches against Rollup Lists are limited to rollups of 50,000 records maximum. This is due to a limitation of how much data we can pass between PCRecruiter and Lucene.
  • The Last Activity field in the records will not trigger a record to be re-indexed if it is changed. This would cause the Lucene index to be slightly older than the PCR index. Searching or Ordering by this field in a Lucene enabled search can produce odd results.
  • Parentheses are not indexed by Lucene. Therefore, searches for phrases that include parenthesis in the search terms may not return expected results. For example, if you wish to find a record containing the keyword (SL-Man) you would omit the parenthesis and search SL-Man.
  • Lucene does not contain a true EQUALS search. Depending on the search terms used, results may contain additional words after the term(s) which were searched for. For example, searching Title | = | Manager along with the keyword Sales could return titles such as “Manager of Production”.

Keyword Highlighting #

When performing a keyword search (resumes, notes, profiles, etc.) Lucene enables highlighting of the relevant terms in the PCRecruiter results. Some points to be aware of when looking at highlight results:

  • The terms do not have to be an exact match in Lucene if any wildcards are in use. Matching word results returned from a ‘fuzzy search’ will be highlighted.
  • When searching for a phrase (example: “sales manager”), records are returned that meet the criteria of the search “sales manager” but the terms in the phrase will be highlighted together as well as individually throughout the record.
  • All matching keywords returned by Lucene will be highlighted.
  • An Attachment that the keyword is found in will be highlighted yellow, but the highlights will not appear in the source document itself once opened.

Search Operators #

An ‘operator’ is a special keyword or symbol that helps you refine and specify your search criteria, allowing you to be more precise when looking for information in your database. In Lucene, there are a variety of operators which can give you very specific and useful control.

Lucene Wildcards #

First-word searching is the default behavior for field searching, meaning that if there is no leading wildcard (*, %) in the term(s) being searched for, only records with that term at the beginning of the field will be returned. For example, searching the “Title” field for “Sales” may return records containing the title “Sales Manager” or “Sales Associate”, but not those containing “Director of Sales”.

First-word searching does not apply when searching Keyword fields (resumes, notes, etc.) The term will be sought anywhere within the keyword indexed content.

Using ‘wildcard’ characters allows us to search for content that doesn’t exactly match the term, and to choose where and how that term is searched. There are three different wildcards for Lucene searches. We’ll explain them in general terms here, and how they function in specific search contexts below.

Multi-Character Wildcard (%) β€” The multiple character wildcard will allow the user to search for partial word terms without knowing the exact spelling of the word or words. Just place a percent sign where the words are missing.

For example, searching Title | LIKE | m%ger could return ‘Manager’. Searching Title | LIKE | %pr%d%t could return “product,” “predict,” “senior vice president,” or “president.”

  • This operator does not require a character to be located where the percent sign has been placed.
  • When using the % wildcard between words the behavior will vary slightly.
  • Lucene enabled searches will only look for a one word maximum while non-Lucene enabled searches will look for anything between the search terms.
  • In Advanced Search, using the LIKE operator will automatically append a % wildcard to your search.
  • Overuse of this operator can cause results to become too ambiguous, resulting in a slower return of results or time outs.

Full Word Wildcard (*) β€” This wildcard provides the ability to find a search term anywhere in a field while limiting the overhead of the leading % wildcard. This operator is ONLY effective when placed before the search term. This operator should be used when attempting to find a search term anywhere in the field when the spelling at the beginning of the word is known.

For example, searching Title LIKE *Manager could return titles like “Sales Manager,” “Product Manager,” or “Production Manager”.

Single Character Wildcard (?) β€” Using the question mark as a search operator will allow for a user to find a word or words without knowing the exact spelling. Unlike the percent (%) wildcard, the question mark operator requires some sort of character to be present in the location of the search operator in the search term.

For example, searching Title | LIKE | %Te?t will return matches for “text” and “test”, whereas Title | LIKE | Test? would return “tests” or “testing” but not “test”, because there is no additional character where the ? wildcard is seeking one. The term “testing” would be a match here because using LIKE automatically assumes a % at the end of the term, which would be accounted for only after the first five characters meeting the search criteria were found.

Commas (,) β€” Commas can be used to separate search terms when field searching combined with the LIKE, NOT LIKE, or IN syntax only. For example, Title | LIKE | sales, sales manager would search for either term.

Fuzzy Search Operator (~) β€” The tilde character is the ‘fuzzy’ search operator, which will find similar words to the keyword which is entered. This can be used only at the END of a search term or next to a SINGLE keyword.

For example, searching nest~ in the keywords area might return records containing “test” or “best” or “jest”.

Distance Proximity Search (“”~#) β€” For this operation, use two words in double-quotes followed by a tilde and a positive, whole number representing the maximum desired distance between the words.

For example, searching “sales manager”~3 in the keywords area could return “sales consultant working towards manager position” because the words ‘sales’ and ‘manager’ are separated by 3 words. Another possible result could be “manager says his sales numbers were great”, because the words are less than 3 words apart, even though they are in a different order.

Term Boost (^#) β€” Using a carat and positive number after a term will boost its value in the equation used for scoring its relevance.

For example, keyword searching product^10 manager sales will return records containing “product manager” and “sales manager”, but the ones containing “product” will be ranked higher than those without it.

OR, AND, NOT β€” These standard Boolean operators are supported in PCRecruiter’s Lucene keyword search implementation. These operators can be placed between words and phrases to create complex or simple searches.

Note: These operators must be in ALL CAPS

  • AND β€” A keyword search for Sales AND Manager will return results for records who have both of those keywords. However, both terms must be located in the same section of the record (Notes, Keywords, Resumes, etc.) If “Sales” was in the resume and “Manager” was in the Notes, the record would not be considered a match.
  • OR β€” A keyword search for Sales OR Manager will return records with either word, anywhere in the keyword indexed areas of the record. “OR” is the default operator, so if no operator is placed between multiple keywords which are not in quotes (creating a phrase), an OR will be assumed (i.e. sales manager and sales OR manager are the same search).
  • NOT β€” This is used to exclude a result from an AND or OR search (it cannot be applied on its own). For example, searching Sales Product NOT Analyst will return results where a record has “Sales” or “Product”, but WILL NOT return those records if they also contain “Analyst”.

Parentheses β€” Boolean grouping with parentheses around groups of words or phrases relates the terms to each other. This will allow users to create possible subsets of required search terms as seen here:

(“sales manager” OR “product manager”) AND analyst

This search brings back any record that had either the phrase “sales manager” or “product manager” and also had the word “Analyst” in a section of keywords (such as Notes, Resumes, etc.)

Additional Keyword Searching Features #

KEYWORD-SPECIFIC SEARCH:

When creating or modifying a search in the Basic or Advanced Search keyword search boxes, a user can use the syntax Keywords: within the search area. When used, the results will be limited to only records having the search terms following the Keywords: delimiter in the actual Keywords section of the record.

For example, searching “Sales Manager” OR “Production Manager” KEYWORDS: MRK1 OR MRK2 can return records including the phrases “Sales Manager” OR “Production Manager” in any keyword indexed section (Resume, Notes, Keywords, Summary, Attachments, and Profiles) in PCRecruiter, but only those with “MRK1” or “MRK2” specifically in the Keywords area of the record.

KEYWORD AS ADVANCED SEARCH

When building Advanced Searches in the Advanced Search screen, the keyword indexed areas of PCRecruiter (Resume, Notes, Keywords, Summary, Attachments, and Profiles) will appear in the dropdown list along with Predefined Fields such as First Name, Company Name, or Job Title. This allows areas like Resume or Attachment to be searched for terms using the same query building tools as one might use for discrete fields.

Indexing Rules #

Keyword searches work by creating an index of the ‘tokens’ within the text. There are various ways to do this token breakdown, but PCR uses the “Standard Tokenizer” to determine which elements of the document are indexed and which are not. When a word is not indexed, this means it is effectively invisible to the search engine.

Here are some things to be aware of in regards to how PCR stores keywords for later searching:

  • We split words at punctuation characters and do not index the punctuation unless otherwise described below.
  • A period that’s not followed by blank space is considered part of a token. For example, Main Sequence Technologies, Inc. (www.pcrecruiter.net) would be indexed as Main Sequence Technologies Inc www.pcrecruiter.net because the dots within the web address aren’t followed by blank space, but the period after Inc is.
  • If a term contains a series of periods separated by only one letter and also ends in a period, the periods will not be indexed. For example, a.b.c. is stored for searching as abc, but aa.bb.cc. is stored with the periods in tact.
  • We split words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product or serial number and is not split. For example, Sherwin-Williams would be indexed as Sherwin Williams, but Sh1-er23-win would be indexed with hyphens in tact.
  • We recognizes internet addresses as one token. Again as above, www.pcrecruiter.net would be treated as a single token, not as three.
  • Terms are split on the @ sign, meaning that an email address will index as two words in that field. This allows the domain section of the email address to be found without needing partial word wild cards, allowing far more efficient searching.
  • We index all letters and numbers, but we do not index accents. (i.e. Über and Uber are treated as identical).
  • We do NOT index the following words, as they would add to much ‘noise’ to the data for effective searching: an, and, are, as, at, be, but, by, for, if, into, is, it (lowercase), not, of, such, that, the, their, then, there, these, they, this, to, was, with. For example, “driving business and decision making” in a resume would be indexed as “driving business decision making” because ‘and’ is a non-indexed word.