Lucene Searching

The Lucene search engine is an alternate keyword search system that is typically implemented for large databases. Customers with the Lucene system enabled will see Lucene in the top right of the Advanced Name Search window indicating that they are enabled with Lucene. All other PCRecruiter configurations use the standard PCRecruiter keyword search.

 

What are the differences between Keyword Version 2 and Lucene?

Search Result Screen

  • “Greener is Greater”: Search results have been changed from percentages to a simple green box. The color saturation of the green in the box will reflect the Lucene score, with a darker green indicating a better match.
  • Sorting: Searches conducted with keywords can be sorted in Lucene.
  • Keyword Searching: When using the AND operator between keywords the keywords must all exist in the same keyword section.
    • Example Keyword Search: Sales AND Manager AND Product

This search Would require all keywords to appear in NOTES to return a record. If ‘Sales’ appears only in NOTES and ‘manager’ appears only in SUMMARY, the record will not be returned

Highlighting:

  • Matching word results returned from a search will be highlighted.
  • All matching keywords returned by Lucene will be highlighted. This means searching for the phrase “Sales Manager” will cause the words Sales and Manager to be independently highlighted in the section as long as the phrase was found.
  • The attachment in which the search term is found in will be highlighted on the attachments page but NOT when the document is opened.

 

  1. Advantages of Lucene:
  • Sortable Keyword Search Result Columns
  • Matching Word Counts in Keyword Results
  • Searching Attachments
  • Notes and Keywords Searched Separately
  1. Searches that use Lucene:
    • When a keyword is involved in the search.
    • When any field is searched with a leading wildcard (%) operator.

 

  • Example Advanced Searches which WOULD involve Lucene:

 

Predefined Fields – Title – Like – %Tester

Keywords: None Entered

Note: The leading % wildcard will cause the search to be run in the Lucene engine.

Predefined Fields – Title – Like – Tester

Keywords: Software

 

  • Example Advanced Search which WOULD NOT use Lucene:

 

Predefined Fields – Title – Like – Test

Keywords: None Entered

 

  • Example Simple Searches which WOULD involve Lucene:

 

First Name: Doug

Keywords: Sales Manager

 

  • Example Simple Searches which WOULD NOT involve Lucene:

 

First Name: Doug

  1. Lucene Wildcard Operators:
  • The % operator offers more flexibility, but will cause the search to take longer in returning values if overused (possibly leading to a timeout). Further explanations are located in the operators section.

Examples:

Keyword Search For: test%

Possible Keywords Returned: Test, Testing, Tester

  • The ? operator requires a character of some sort to be present.

Examples:

Keyword Search For: test?

Possible Keywords Returned: Tests

Possible Keywords NOT Returned: Tes

 

  1. Rollup searching limitations of Lucene:

Searches against rollup lists are limited to rollups of 50,000 records maximum. This is due to a limitation of how much data we can pass between PCRecruiter and Lucene.

  1. Keyword highlighting:

The words no longer have to be an exact match in Lucene IF wildcards are being used. Lucene returns all words that match, with a count. All of the returned terms are appended to the highlight code to allow them to be highlighted.

  • Note: The attachment that the keyword is found in will be highlighted. When opened, the keywords will not highlight as we are opening an editable copy and do not want to allow original data to be changed.
  1. Sections of PCRecruiter that are not searched using the Lucene search engine:
  • Activity Searches
  • Email Searches
  • Rollup Search Utility
  1. Sections of PCRecruiter that are searched using Lucene:
  • Name, Company and Position Records; Predefined and Custom Fields
  • Resumes
  • Attachments
  • Notes
  • Keywords
  • Summaries
  • Profiles

Searching Operators

Field Searching

  • % Multi Character Wildcard: The multiple character wildcard will allow the user to search for partial word terms without knowing the exact spelling of the word or words. Place a percent sign where the words are missing. This operator does not require a character to be located where the percent sign has been placed.
    • Examples:

Predefined Fields –Title — LIKE – Manag%r%

Possible Result: Manager

Possible Result: Managers

Predefined Fields –Title — LIKE – %Pr%d%t

Possible Result: Product

Possible Result: Predict

Possible Result: Senior Vice President

Possible Result: President

NOTE: In advanced searching using the LIKE operator will automatically append a % wildcard to the end of a search.

NOTE: Overuse of this operator will cause results to become too ambiguous and cause slower returning of results or timeouts depending on the search.

  • ? Single Character Wildcard: The question mark as a search operator will allow for a user to find a word or words without knowing the exact spelling. Unlike the percent wildcard, the question mark operator requires some sort of character to be present in the location of the search operator in the search term.
    • Examples:

Predefined Fields –Title — EQUAL – Test?

Possible Result: Tests

Not a Result: Test

Not a Result: Testing

Predefined Fields –Title — LIKE – %Te?t

Possible Result: Text

Possible Result: First Test

Predefined Fields –Title — LIKE – Test?

Possible Result: Tests

Possible Result: Testing

Not a Result: Test

Note: The search term ‘testing’ will come back in the final example because using the ‘LIKE’ syntax causes the search engine to automatically add a ‘%’ AFTER the question mark wildcard. The the second wildcard will be considered after the first five characters are found that meet the criteria of the search.

  • Commas: Commas can be used to separate search values.
    • Examples:

Predefined Fields – Title – LIKE – Sales, Manager

Possible Result: Sales Lead

Possible Result: Manager of Marketing

Note: This is not limited to Lucene searching and also applies to Keyword Version 2

Keyword Searching

  • % Multi Character Wildcard: The multiple character wildcard will allow the user to search for partial word terms without knowing the exact spelling of the word or words. Just place a percent where the letters are missing.=
    • Examples:

Keyword Search: “Pr%d%t Manager”

Possible Result: Prdt Manager

Possible Result: Product Manager

Keyword Search: “product% Manager”

Possible Result: Production Manager

Possible Result: Product Manager

Keyword Search: “%r%d%t% m%g%r”

Possible Result: Production Manager

Possible Result: Product Manager

NOTE: Overuse of this operator will cause results to become too ambiguous and cause slower returning results or timeouts depending on the overall search run.

  • ? Single Character Wildcard: The question mark as a search operator will allow for a user to find a word or words without knowing the exact spelling. Unlike the percent wildcard the question mark operator requires some sort of character to be present in the location of the search operator in the search term.
    • Examples:

Keyword Search: Te?t

Possible Result: Test

Possible Result: Text

Keyword Search: Test?

Possible Result: Tests

Not a Result:Text

NOTE: ‘Text’ is not a result because Lucene expects there to be a character in this location.

  • Fuzzy Search Operator: The Fuzzy Search operator is available to find similar words to the keyword which is typed. This can only be placed at the END of a word. This also can only be added next to a SINGLE keyword.
    • Examples:

Keyword Search: Keyword Search: nest~

Possible Result: test

Possible Result: best

  • “keyword1 keyword2″~# Distance Proximity Searching: Using the distance proximity search syntax, two words can be placed in quotes followed by a tilde and a number which represents the maximum desired distance between the words. This number must be a whole number and a positive number. The order of the words in quotes does matter as seen in the examples below.
    • Examples:

Keyword Search: “Sales Manager”~3

Possible Result: sales consultant working towards manager position.

Possible Result: sales manager

  • ^ Term Boosting: A term can be given a boosted value in the scoring equation by placing the carat and a number after the search term. This number must be a positive whole number greater than zero.
    • Examples:

Keyword Search: Product^10 Manager Sales

Possible Result: Product Manager

Note: Due to Boosting, the records with ‘Product Manager’ will come back with a higher relative score than ‘Sales Manager’.

  • OR, AND, NOT Boolean Operators: Currently OR, AND and NOT are the Boolean operators supported in PCRecruiter’s Lucene keyword search implementation. These operators can be placed between words and phrases to create complex or simple searches. If no operator is placed between multiple keywords which are not in quotes (creating a phrase) the default operator OR will be used.

Note: These operators must be in ALL CAPS

 

  • AND

 

Keyword Search: Sales AND Manager Will return results for records who have both of those keywords located in any one section of PCR (Notes, Keywords, Resumes, etc.)

NOTE: Both of these words have to appear in any one section to return results. If Sales is only located in the records resume and Manager is only located in the records NOTES the name WILL NOT come back.

 

  • OR

 

Keyword Search: Sales OR Manager (same as typing: Sales Manager) will return results for records who have keywords of the words Sales or Manager anywhere in the keyword indexed areas such as Notes, Keywords, Resume, etc.

 

  • NOT

 

Keyword Search: Sales Product NOT Analyst will return results where a record has Sales OR Product, but WILL NOT return records that have Sales Product but also have the word Analyst.

NOTE: The NOT operator cannot be used by itself. For example, the Keyword Search NOT “business analyst” will return no valid results when used by itself. A message is would be displayed noting “Error: Cannot perform a NOT boolean search without any other criteria.”

 

  • (keyword) Boolean Grouping: Parenthesis can be placed around groups of words or phrases to better relate groups of search terms to each other. This will allow users to create possible subsets of required search terms as seen in the example below.
    • Example:

(“sales manager” OR “product manager”) AND analyst  

This search brings back any record that had either the phrase “sales manager” or “product manager” and also had the word Analyst in a section of keywords (such as Notes, Resumes, etc.)

Search Results

  • Sorting: The ability to sort is available in Lucene searches by most field names in the search results screens in a Lucene enabled database.
  • Matching Item Counts: Keyword matching item counts are included when hovering over the appropriate section in keyword search results. Lucene will return counts for fuzzy searches as well as multiple uses of the wildcard.
  • Greener is Greater: The shade of green is based on the “score” which Lucene assigns to search results, with the highest scores being brought to the top by default. This is determined by how many times the search criteria appears in the text as well as how close to the top of the searched text the result appears.

INDEXING RULES

  • We use the Standard Tokenizer in PCR’s version of Lucene. What does this mean?
  • Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes internet hostnames as one token.
  • Indexing Explanations/Examples:
  • Special Rules:
  • All terms are split on the @ sign, meaning that an email address will index as two words in that field. This allows the domain section of the email address to be found without needing partial word wild cards (which is far more efficient for searching).
    • Example:

Raw Data: test@mainsequence.net

Indexed Data:test mainsequence.net

Character Rules:

The following general rules are followed for indexing of general characters.

  • Letters: A through Z
  • Numbers: 0 through 9
  • Periods which are not followed by whitespace.
  • Dashes when considered part of a serial number.
  • Accents are not indexed:
  • The following “Noise” or “Stop” words are not indexed:

 

  • an
  • and
  • are
  • as
  • at
  • be
  • but
  • by
  • for
  • if
  • into
  • is
  • it (lowercase)
  • not
  • of
  • such
  • that
  • the
  • their
  • then
  • there
  • these
  • they
  • this
  • to
  • was
  • with

 

What does it mean when words are not indexed?

Raw Data: business and decision

Indexed Data: business decision

  • Search Examples to find this data in Lucene (Field Search)

Predefined Fields—Title – LIKE — %Business and Decision

Predefined Fields—Title – LIKE — Business and Decision

  • Search Example to find this data in Keywords

Keywords: “Business and Decision”

Keywords: “Business Decision”

 

What do the tokenizer rules mean?

Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.

Examples:

Raw Data: Sales Operations Manager at Main Sequence Technologies, Inc. (www.pcrecruiter.net)

Indexed Data: Sales Operations Manager at Main Sequence Technologies Inc www.pcrecruiter.net

Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.

Examples:

Raw Data: ADMIN-SMITH

Indexed Data: ADMIN SMITH

Raw Data: ADMIN1-SMITH

Indexed Data: ADMIN1-SMITH

Recognizes internet hostnames as one token.

 

Examples:

Raw Data: Main Sequence Technologies, Inc. (www.pcrecruiter.net)

Indexed Data: Main Sequence Technologies Inc www.pcrecruiter.net