Search Terms – Advice from the Experts
Search Terms – Advice from the Experts
As data volumes grow, the use of technology to make sense of it is increasing – especially within disputes and investigations where the key data is buried within thousands or millions of documents, emails, text messages, chats, images, or databases, among multitudes of other constantly evolving sources of data.
A key initial step to identifying relevant data is through search terms. It sounds simple – understand the case, get input from the client, quickly draft search terms and let the tech team handle the rest. However, balancing the risk of being too narrow (and potentially missing crucial data) against being too wide (and pushing a flood of irrelevant data into review) is an art – the results of which have a direct impact on the overall cost of discovery – more data = more cost.
In this article, our team offers insights into how to get started and some under-utilised tech features that can help streamline this process, reducing the associated risks, timelines, and costs.
1. Start Broadly
Compiling the initial search terms should take a broad approach; think of it as building the foundation for more targeted searches. These initial terms should be aligned with the specifics of the case (e.g., looking for key names, parties, and case concepts), beginning with a thorough understanding of the case and extensive input from the client.
Often, this foundation will result in large volumes of results, making it difficult to narrow the search without inadvertently excluding critical information. This is where advanced analytics tools can help refine your searches.
2. Use Analytics to Learn More
Some effective tools you can utilise include:
- Clustering: Documents are grouped into 'clusters' based on the concepts within the documents. Each cluster is titled with the most common words within the set – helping you identify shared themes, patterns within your data and new potential search criteria not previously included in your queries.
- Word Lists: A report of all the words used across the dataset allows you to look at the frequency of keywords and their variations and choose some to supplement or modify your search criteria.
- Metadata Filters: Filtering by domains, dates and other metadata can significantly reduce irrelevant results. Various reports detailing specific types of metadata can be provided to help.
- 'More Like This' Searches: Using sample documents to find similar ones can refine your search and improve accuracy. Has the client sent you something extremely relevant? If so, use it!
Finally, try thinking about the opposite. See if you can identify documents that can be easily excluded – things like unrelated project names; plainly irrelevant, non-English content; certain custodians; marketing and spam emails; or data outside the timeframe of the matter.
Using these simple methods, in a few clicks we have now:
- Excluded data that is clearly irrelevant – lowering ongoing fees.
- Found some new terms we hadn’t considered before – lowering risk.
ECA tools enable lawyers to better understand their data before starting linear document review – providing more certainty, higher quality data in review and significantly lower costs.
3. Understand Search Logic
Correctly using search logic and syntax is crucial but can easily lead to ambiguity or errors. Work closely with the tech team assisting you, and clearly explain your objectives to ensure accurate search queries are built and run.
Some useful key terms:
- AND/OR Operators:
- AND: includes documents that contain all specified terms.
- OR: includes documents that contain any of the specified terms.
- Proper Use: when combining multiple AND / OR operators, use brackets to group terms and clarify what you’re looking for.
For example – you’re looking for data with the terms ‘cat’ and ‘dog’ or ‘tail’ and ‘fur’
- cat AND dog OR tail AND fur – it is unclear what the relationship between all four terms should be.
- Should all terms be present?
- cat OR dog OR tail OR fur
- Should ‘cat’ and ‘fur’ be present – but only if ‘dog’ OR ‘tail’ are also?
- (cat AND Fur) AND (dog OR tail)
- Should ‘cat’ and ‘dog’ be present – and also ‘tail’ and ‘fur’?
- (cat AND dog) AND (tail AND fur)
- Should ‘cat’ and ‘dog’ be present – but ‘tail’ and ‘fur’ can also be present?
- (cat AND dog) OR (tail AND fur)
- Should all terms be present?
- Wildcards: help to avoid duplicating search terms by including all versions.
For example, the keyword ‘dog*’
This wildcard search would capture ‘dog’, ‘dogs’, ‘doggo’, ‘doggy’ and ‘doghouse’, but it would also capture unrelated concepts like ‘dogmatic’ or ‘dogwood’.
To get a little more advanced, when working with more complicated searches, try to break down or simplify the terms to avoid the risk of errors:
- Don’t Be Afraid to Use Brackets: even if it is just to help visualise the search logic and the relationship between the phrases:
- Instead of rain* w/2 cat OR dog AND tail OR fur o Use (rain* w/2 (cat OR dog)) AND (tail OR fur)
- Use New Lines: Searching for all terms in one query doesn’t provide insight into which terms are driving the results. Using a new line will show which components are causing spikes in results or a lack thereof:
- Instead of (cat AND dog) OR (tail AND fur) – returns 1,000 docs
- Use:
- (cat AND dog) – returns 200 docs
- (tail AND fur) – returns 800 docs
- Additionally, we can see 50 documents that return both.
By breaking down this longer term into multiple smaller ones, we’re able to see and report which of the phrases are actually hitting on documents. As you can see, ‘tail AND fur’ make up 80% of results.
- You Don’t Need Quotes: Adding quotation marks to terms (unlike Google or Outlook searching) is generally redundant. As an example, the below would yield identical results:
- Fluffy Dog
- “Fluffy Dog”
- Quotation marks are only needed when an operator term (AND / OR / NOT / ‘w/’, etc.) is part of a specific search term (e.g., “Oil and Gas” or “Health and Aged Care”). So, don’t worry about adding quotes unless you see these operators within them.
- For example:
- “Health and Aged Care” – only data with this exact four-word phrase would be found.
- If you want data with both keywords / phrases, regardless of location, you can use either of the following:
- Health and Aged Care
- “Health” and “Aged Care”
4. Verify the Results Before Review
All of the above can be done within an early case assessment tool before data is moved to review and hosted at a higher cost. So, verifying these results to ensure you aren’t over- / under-inclusive is a key step, as it directly impacts the overall time and cost in review.
The easiest method is to sample your results, either within the tool itself or exporting a set offline. Reviewing a sample can provide insights into the effectiveness of your search terms, highlight areas for improvement and weed out false positives.
TransPerfect’s proprietary early case assessment tool, Digital Reef, was designed to make the above simple, user-friendly and fast – whether it’s a thousand documents or millions. Digital Reef has enabled clients to understand datasets at scale, get critical data to their teams more easily and efficiently, and significantly reduce their overall costs by pushing, on average, 50% less data into review than the industry baseline.
Reach out today to find out more about how the Reef Suite of products can help your team.