Lucene Searching

Please note the following is only applicable to a Lucene enabled database.

  1. When is a search using the Lucene Engine instead of SQL?

    • When a keyword is involved in the search.
    • When any field is searched with a leading wildcard (%) operator or (*) operator.
      • Example Advanced Searches which WOULD involve Lucene:
        • Predefined Fields – Title – Like – %Tester
        • Keywords: None Entered
        • Note: The leading % wildcard will cause the search to be run in the Lucene engine.
        • Predefined Fields – Title – Like – Tester
        • Keywords: Software
        • Keywords: Software
      • Example Advanced Searches which WOULD involve Lucene:
        • Predefined Fields – Title – Like – *Tester
        • Keywords: None Entered
        • Note: The leading * wildcard will cause the search to be run in the Lucene engine.
        • Predefined Fields – Title – Like – Tester
        • Keywords: Software
        • Keywords: Software
      • Example Advanced Searches which WOULD NOT involve Lucene:
        • Predefined Fields – Title – Like – Test
        • Keywords: None Entered
      • Example Simple Searches which WOULD involve Lucene:
        • First Name: Doug
        • Keywords: Sales Manager
        • Keywords: Sales Manager
      • Example Simple Searches which WOULD NOT involve Lucene:
        • First Name: Doug
  2. When is first-word searching enabled in Lucene?

  3. First-word searching is now the default behavior for field searching in Lucene. First-word field searching means that if there is no leading wildcard (*, %) in the term(s) being searched for, only records with that term at the beginning of the field will be returned.
  4. First-word searching IS NOT ENABLED when searching keyword fields.
    • Example Advanced Search which WOULD involve first-word searching:
      • Predefined Fields – Title – Like – Manager
      • Keywords: Sales
      • Possible Results for the title field: Manager, Manager of Production
      • Note: The keyword “sales” could have appeared in any location within the keyword fields..
    • Example Advanced Searches which WOULD NOT involve first-word searching:
      • Keywords: Sales
      • Possible Results: Sales
      • Note: The keyword “sales” could have appeared anywhere within the document in which it was found.
  5. Why is Lucene not used exclusively for searching when it is enabled?

    • Efficiency of simple searches: A lookup to a simple indexed field for something like a first and last name would be far more efficient doing a quick SQL lookup than having to jump over to the Lucene index.
  6. What are the advantages of Lucene?

    • Sortable Keyword Search Result Columns
    • Matching Word Counts in Keyword Results
    • Searching Attachments
    • Notes and Keywords Searched Separately
  7. How does Lucene currently handle wilcards?

    • The % operator offers more flexibility, but will cause the search to take longer in returning values if overused (possibly leading to a timeout). This wildcard would be used when searching for partial words. Further explanations are located in the operators section of this document.
      • Examples:
        • Keyword Search For: test%
        • Possible Keywords Returned: Test, Testing, Tester
    • The ? operator requires a character of some sort to be present.
      • Examples:
        • Keyword Search For: test?
        • Possible Keywords Returned: Tests
        • Possible Keywords NOT Returned: Test
    • The * can be used in field searching in front of a word to allow for a search term to be found anywhere in a field.
      • Example:
        • Predefined — Title — Like — *manage
        • Possible Results Returned: Manager, Sales Manager, Production Management
  8. What are the limitations of Lucene?

    • Searches against rollup lists are limited to rollups of 50,000 records maximum. This is due to a limitation of how much data we can pass between PCRecruiter and Lucene.
    • The Last Activity field in the records will not trigger a record to be re-indexed if it is changed. This would cause the Lucene index to be slightly older than the PCR index. Searching or Ordering by this field in a Lucene enabled search can produce odd results.
    • When running keyword searches for phrases with parenthesis in the search terms (example: “(SL-Man) Sales manager”) the results may not return as expected. Remove the parenthesis from the phrase search and replace them with spaces, as the parenthesis are not indexed.
      • Examples:
        • Incorrect Keyword Search: “(SL-Man) Sales manager”
        • Possible Results Returned: SL Sales Manager, Man Sales Manager
        • Correct Keyword Search: “SL-Man Sales manager”
        • Possible Results Returned: (SL-Man) Sales manager
    • Lucene does not contain a true EQUALS search. Depending on the search terms used results may contain words after the term(s) which were searched for.
      • Examples:
        • Predefined Fields – Title – EQUALS – Manager
        • Keywords: Sales
        • Possible Results for the title field: Manager, Manager of Production
  9. How does keyword highlighting work?

    • The words no longer have to be an exact match in Lucene IF wildcards are being used. Lucene returns all words that match with a count. All of the returned terms are appended to the highlight code to allow them to be highlighted.
    • Note: The attachment that the keyword is found in will be highlighted yellow. When opened, the keywords will not appear highlighted, as we are opening an editable copy and do not want to allow original data to be changed.
    • Note: When searching for a phrase (example: “sales manager”), records are returned that meet the criterea of the search “sales manager” but the terms in the phrase will be highlighted together as well as individually throughout the record.
  10. Which sections of PCRecruiter ARE NOT searched using Lucene?

    • Activity Searches
    • Email Searches
    • Rollup Search Utility
  11. Which sections of PCRecruiter ARE searched using Lucene?

  1. Name Records
  • Company Records
  • Position Records
  • Predefined/Custom Fields
  • Predefined/Custom Fields
  • Predefined/Custom Fields
  • Resume
  • Attachments
  • Attachments
  • Attachments
  • Notes
  • Notes
  • Notes
  • Keywords
  • Keywords
  • Keywords
  • Summary
  • Profiles
  • Summary
  • Profiles
  • Profiles
  • What are the differences between Keyword Version 2 and Lucene?

  • Search Result Screen
    • “Redder is Better”: Search results have been changed from percentages to a simple red box. The color saturation of the red in the box will reflect the Lucene score, with a deeper red indicating a better match.
    • Sorting: Searches with keywords in them can be sorted in Lucene.
    • Keyword Searching: When using the AND operator between keywords, the keywords must all exist in the same indexed field or document, unless using the keyword filters from the “Predefined:” dropdown list (which search fields individually). This limitation applies to the search placed in the keyword search box which loads automatically at the bottom of the Advanced Search Screen.
      • Example:
      • Keyword Search: Sales AND Manager AND Product
      • This search Would require all keywords to appear in NOTES to return a record. If ‘Sales’ appears only in NOTES and ‘manager’ appears only in SUMMARY, the record will not be returned.
    • Highlighting:
      • Matching word results returned from a fuzzy search will be highlighted.
      • All matching keywords returned by Lucene will be highlighted. This means searching for the phrase “Sales Manager” will cause the words Sales and Manager to be independently highlighted in the section where the words were found.
      • The attachment in which the search term is found in will be highlighted on the attachments page but NOT when the document is opened.

Searching Operators

  1. Field Searching

    • * “Full” Word Wildcard With first-word searching available in the latest version of the PCRecruiter Lucene implementation, there arose a need to add a wildcard which would not create a partial word search when placed at the front of a word. The * provides the ablity to find a search term anywhere in a field while limiting the overhead of the leading % wildcard. This operator is ONLY effective when placed before the search term. This operator should be used when attempting to find a search term anywhere in the field when the spelling at the beginning of the word is known.
      • Examples:
        • Predefined — Title — LIKE — *Manager
        • Keywords: (None Entered)
        • Possible Results: Sales Manager, Product Manager, Production Manager
        • Predefined — Title — LIKE — *Manage
        • Keywords: (None Entered)
        • Possible Results: Manager, Sales Manager, Product Manager, Manage
    • % Multi Character Wildcard: The multiple character wildcard will allow the user to search for partial word terms without knowing the exact spelling of the word or words. Place a percent sign where the words are missing. This operator does not require a character to be located where the percent sign has been placed. When using the % wildcard between words the behavior will vary slightly. Lucene enabled searches will only look for a one word maximum while non-Lucene enabled searches will look for anything between the search terms.
      • Examples:
        • Predefined Fields –Title — LIKEm%ger
        • Keyword: Sales
        • Possible Result: Manager
        • Predefined Fields –Title — LIKE%Pr%d%t
        • Possible Result: Product
        • Possible Result: Predict
        • Possible Result: Senior VicePresident
        • Possible Result: President
        • NOTE: In advanced searching using the LIKE operator will automatically append a % wildcard to a search.
        • NOTE: Overuse of this operator will cause results to become too ambiguous and cause slower returning of results or timeouts depending on the search.
        • non-Lucene enabled search
        • Predefined Fields –Title — LIKESales % Manager
        • Keyword: None Entered
        • Possible Result: Sales Manager, Sales Department Manager, Sales Department Store Manager
        • Lucene enabled search
        • Predefined Fields –Title — LIKESales % Manager
        • Keyword: Sales
        • Possible Result: Sales Manager, Sales Department Manager
        • NOTE: In the last two examples that Sales Department Store Manager will not appear in the Lucene search as there are multiple words between the search terms.
    • ? Single Character Wildcard: The question mark as a search operator will allow for a user to find a word or words without knowing the exact spelling. Unlike the percent wildcard, the question mark operator requires some sort of character to be present in the location of the search operator in the search term.
      • Examples:
        • Predefined Fields –Title — LIKE%Te?t
        • Possible Result: Text
        • Possible Result: Test
        • Predefined Fields –Title — LIKETest?
        • Keyword: Sales
        • Possible Result: Tests
        • Possible Result: Testing
        • Not a Result: Test
        • Note: The search term ‘testing’ will come back in the final example because using the ‘LIKE’ syntax causes the search engine to automatically add a ‘%’ AFTER the question mark wildcard. The the second wildcard will be considered after the first five characters are found that meet the criteria of the search.
    • Commas: Commas can be used to separate search when field searching combined with the LIKE,NOT LIKE, or IN syntax only.
      • Examples:
        • Predefined Fields – Title – LIKESales, Sales Manager
        • Note: This is not limited to Lucene searching and also applies to the new advanced search engine.
  2. Keyword Searching

    • % Multi Character Wildcard: The multiple character wildcard will allow the user to search for partial word terms without knowing the exact spelling of the word or words. Just place a percent where the letters are missing.
      • Examples:
        • Keyword Search: “Pr%d%t Manager”
        • Possible Result: Prdt Manager
        • Possible Result: Product Manager
        • Keyword Search: “product% Manager”
        • Possible Result: Production Manager
        • Possible Result: Product Manager
        • Keyword Search: “%r%d%t% m%g%r”
        • Possible Result: Production Manager
        • Possible Result: Product Manager
        • NOTE: Overuse of this operator will cause results to become too ambiguous and cause slower returning results or timeouts depending on the overall search run.
    • ? Single Character Wildcard: The question mark as a search operator will allow for a user to find a word or words without knowing the exact spelling. Unlike the percent wildcard the question mark operator requires some sort of character to be present in the location of the search operator in the search term.
      • Examples:
        • Keyword Search: Te?t
        • Possible Result: Test
        • Possible Result: Text
        • Keyword Search: Test?
        • Possible Result: Tests
        • Not a Result:Test
        • NOTE: ‘Test’ is not a result because Lucene expects there to be a character in this location.
    • ~ Fuzzy Search Operator: The Fuzzy Search operator is available to find similar words to the keyword which is typed. This can only be placed at the END of a word. This also can only be added next to a SINGLE keyword.
      • Examples:
        • Keyword Search: Keyword Search: nest~
        • Possible Result: test
        • Possible Result: best
    • “keyword1 keyword2″~Distance Proximity Searching: Using the distance proximity search syntax, two words can be placed in quotes followed by a tilde and a number which represents the maximum desired distance between the words. This number must be a whole number and a positive number. The order of the words in quotes does not matter as seen in the examples below.
      • Examples:
        • Keyword Search: “Sales Manager”~3
        • Possible Result: sales consultant working towards manager position
        • Possible Result: manager of sales
    • ^ Term Boosting: A term can be given a boosted value in the scoring equation by placing the carat and a number after the search term. This number must be a positive whole number greater than zero.
      • Examples:
        • Keyword Search: Product^10 Manager Sales
        • Possible Result: Product Manager
        • Note: Due to Boosting, the records with ‘Product Manager’ will come back with a higher relative score than ‘Sales Manager’.
    • OR, AND, NOT Boolean Operators: Currently OR, AND and NOT are the Boolean operators supported in PCRecruiter’s Lucene keyword search implementation. These operators can be placed between words and phrases to create complex or simple searches. If no operator is placed between multiple keywords which are not in quotes (creating a phrase) the default operator OR will be used.
    • Note: These operators must be in ALL CAPS
      • Example AND:
        • Keyword Search: Sales AND Manager Will return results for records who have both of those keywords located in any one section of PCR (Notes, Keywords, Resumes, etc.)
        • NOTE: Both of these words have to appear in any one section to return results. If Sales is only located in the records resume and Manager is only located in the records NOTES the name WILL NOT come back.
      • Example OR:
        • Keyword Search: Sales OR Manager (same as typing: Sales Manager) will return results for records who have keywords of the words Sales or Manager anywhere in the keyword indexed areas such as Notes, Keywords, Resume, etc.
      • Example NOT:
        • Keyword Search: Sales Product NOT Analyst will return results where a record has Sales OR Product, but WILL NOT return records that have Sales Product but also have the word Analyst.
        • NOTE: The NOT operator cannot be used by itself. For example, the Keyword Search NOT “business analyst” will return no valid results when used by itself. A message is would be displayed noting “Error: Cannot perform a NOT boolean search without any other criteria.
    • (keyword) Boolean Grouping: Parenthesis can be placed around groups of words or phrases to better relate groups of search terms to each other. This will allow users to create possible subsets of required search terms as seen in the example below.
      • Example:
        • (“sales manager” OR “product manager”) AND analyst
        • This search brings back any record that had either the phrase “sales manager” or “product manager” and also had the word Analyst in a section of keywords (such as Notes, Resumes, etc.)
  3. Additional Keyword Searching Options

    • Keywords: When creating or modifying a search in the simple or Advanced Search keyword search boxes, a user can now use the syntax Keywords: within the search area. When used, the results will be limited to only records having the search terms following the Keywords: delimiter in the actual Keywords section of the record.
      • Examples:
        • “Sales Manager” OR “Production Manager” KEYWORDS: MRK1 OR MRK2
        • Possible Results will include the phrases “Sales Manager” OR “Production Manager” in any keyword indexed section (Resume, Notes, Keywords, Summary, Attachments, and Profiles) in PCRecruiter, but only those with “MRK1” or “MRK2” specifically in the Keywords area of the record.
    • Advanced Keyword Search Options: When building advanced searches in the Advanced Search screen, the keyword indexed areas of PCRecruiter will now appear under the Predefined Field List along with the other Predefined fields such as First Name, Company Name, or Job Title.
      • Examples:
        • Predefined Field — First Name — Like — Doug
        • AND
        • Predefined Fields — RESUME — “Sales Manager” OR “Production Manager”
        • AND
        • Predefined Fields — ATTACHMENTS — “Sales Manager” OR “Production Manager”
        • AND
        • Predefined Fields — KEYWORDS — MRK1 OR MRK2
        • Possible Results would include anyone with the first name of Doug (or Douglas, etc.) with a Resume AND Attachment containg “Sales manager” OR “Production Manager” AND also containing the keywords MRK1 or MRK2 in the Keywords area of the record.

Search Results

  1. Sorting: The ability to sort is available in Lucene searches by most field names in the search results screens in a Lucene enabled database.
  2. Matching Item Counts: Keyword matching item counts are included when hovering over the appropriate section in keyword search results. Lucene will return counts for fuzzy searches as well as multiple uses of the wildcard.
  3. Redder is Better: The level of redness is based on the “score” which Lucene assigns to search results, with the highest scores being brought to the top by default.

Indexing Rules

  1. We use the Standard Tokenizer in PCR’s version of Lucene. What does this mean?
    • Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
    • If a term contains a series of periods separated by only one letter and also ends in a period the periods will not be indexed.
    • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
    • Recognizes internet hostnames as one token.
  2. Indexing Explanations/Examples:
    • Special Rules:
      • All terms are split on the @ sign, meaning that an email address will index as two words in that field. This allows the domain section of the email address to be found without needing partial word wild cards (which is far more efficient for searching).
        • Example:
          • Raw Data: test@mainsequence.net
          • Indexed Data:test mainsequence.net
    • Character Rules:
      • The following general rules are followed for indexing of general characters.
        • Letters: A through Z
        • Numbers: 0 through 9
        • Periods which are not followed by whitespace.
        • Dashes when considered part of a serial number.
        • Accents are not indexed:
          • Raw Data: Über
          • Indexed Data: Uber
    • The following “Noise” or “Stop” words are not indexed:
      • an
      • and
      • are
      • as
      • at
      • be
      • but
      • by
      • for
      • if
      • into
      • is
      • it (lowercase)
      • not
      • of
      • such
      • that
      • the
      • their
      • then
      • there
      • these
      • they
      • this
      • to
      • was
      • with
    • What does it mean when words are not indexed?
      • Raw Data: business and decision
      • Indexed Data: business decision
        • Search Examples to find this data in Lucene (Field Search)
          • Predefined Fields—Title – LIKE — *Business and Decision
          • Keywords: (None Entered)
          • Predefined Fields—Title – LIKE — Business and Decision
          • Keywords: Sales
        • Search Example to find this data in Keywords
          • Keywords: “Business and Decision”
          • Keywords: “Business Decision”
    • What do the tokenizer rules mean?
      • Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
        • Examples:
          • Raw Data: Sales Operations Manager at Main Sequence Technologies, Inc. (www.pcrecruiter.net)
          • Indexed Data: Sales Operations Manager at Main Sequence Technologies Inc www.pcrecruiter.net
      • If a term contains a series of periods separated by only one letter and also ends in a period the periods will not be indexed.
        • Examples:
          • Raw Data: a.b.c.
          • Indexed Data: abc
          • Raw Data: aa.bb.cc.
          • Indexed Data: aa.bb.cc
      • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
        • Examples:
          • Raw Data: ADMIN-SMITH
          • Indexed Data: ADMIN SMITH
          • Raw Data: ADMIN1-SMITH
          • Indexed Data: ADMIN1-SMITH
      • Recognizes internet hostnames as one token.
        • Examples:
          • Raw Data: Main Sequence Technologies, Inc. (www.pcrecruiter.net)
          • Indexed Data: Main Sequence Technologies Inc www.pcrecruiter.net