Call now: (800) 766-1884  



 Home


 SQL Server Tips
 SQL Server Training

 SQL Server Consulting
 SQL Server Support
 SQL Server Remote DBA



 Articles
 Services
 SQL Server Scripts
 Scripts Menu



 

 

 

 
 

SQL Server

SQL Server Tips by Gama and Naughter Consulting

 


Applications for XP_REGEXP

XP_REGEXP brings all the power and flexibility of regular expressions to TSQL and in particular to SQL queries. This will be particularly useful for validating input, performing better searches, checking text (finding duplicates, punctuation errors, etc.).

Ecommerce search based on XP_REGEXP

One of the most powerful applications of XP_REGEXP would be in an Ecommerce web site, to perform elaborate searches. In Ecommerce it is common to have an ASP web application with pages that change dynamically. The data for these pages is read from a database when the client requests the page. Before rendering the HTML to be sent to the user, the web server will retrieve the data from the SQL Server and place it in a table, grid or any other object.

If the page performs a data search, this process must be optimized so that it will return the maximum number of matches, including partial matches, in the least amount of time. Avoiding illogical matches is also important and that excludes very loose Soundex matches. Matching more than one keyword is another factor against Soundex or similar algorithms because it would require code that would be complicated and slow.
Ecommerce databases might work with huge amounts of data or with numerous simultaneous users. Either way, they are very demanding and require the fastest tools and techniques because a slow response time is enough to lose customers and harm business.
This example uses data from a real world application to show that XP_REGEXP is production quality and that the benefits of using XPís must be considered when designing a high performance project.

The code that accompanies the book comes with an Ecommerce database with nearly 10,000 records; the table has been simplified to contain only one field of text. The table is named TableTools and has one field named Short Description, which is a description of tools and products. It is necessary to import this table because the following tests are based on it. In Microsoft SQL Server Enterprise Manager, right clicking on the Tables node will display a menu with several options, one of which is All Tasks, this option has two sub options, one for importing and the other one for exporting data. From here it is necessary to choose text data as a source, browse to the file TableTools.csv and choose the option of having the columnsí names in the first row.

Now that the test data has been loaded, the first test will be to compare XP_REGEXP with Cory Koskiís UDF. The UDF took 15 seconds while XP_REGEXP took just three. This is the sample code used:

select * from TableTools where dbo.find_regular_expression([Short Description],'heavy.*drill',1)=1

select * from TableTools where dbo.findRegex([Short Description],'heavy.*drill',0)=1

Will the consumers who are looking for a tool have to know regular expressions to do a search? Yes, and they will also have to calculate the totals of their invoice manually and using Reverse Polish Notation! While this might sound OK for the developer it would ruin the business operation.
The best solution is to parse the user input and transform it into a regular expression that will maximize the results from the search.

The process of creating a regular expression automatically must follow rules derived from the particular data involved. In this example there are several scenarios considered as a cause of missing hits from the search:

* Not using spaces between numeric and alphanumeric keywords.
* To use extraneous characters between keywords, including unnecessary keywords.
* To use either a name or its representation, for example inch or ".
* To use synonyms or short forms, for example pc or piece.
* To use singular or plural, for example nail or nails.
* To use decimal numbers instead of fractions.

There are other scenarios but for this example, these will be a good start for extracting as many rows per search as possible.
The parsing is done with a sp (SPparseInputForRegex) that will use string and character manipulation in order to create a regular expression that will solve the six problems mentioned above. The sp has a loop that will check one character at a time and replace characters other than alphanumeric ones' and the inch symbol, the divide sign and the decimal point. The function REPLACE is used next to make the input closer to the table data.

See Code Depot

The code replaces spaces with .* to force all the garbage between two keywords to be rejected.

Testing the code with difficult searches:

declare @regex varchar(8000)

EXEC SPparseInputForRegex '1"nails', @regex OUT
select [Short Description] from TableTools where dbo.findRegex([Short Description],@regex,0)=1

EXEC SPparseInputForRegex '1" nails', @regex OUT
select [Short Description] from TableTools where dbo.findRegex([Short Description],@regex,0)=1

EXEC SPparseInputForRegex '1inch nails', @regex OUT
select [Short Description] from TableTools where dbo.findRegex([Short Description],@regex,0)=1

EXEC SPparseInputForRegex '1nail', @regex OUT
select [Short Description] from TableTools where dbo.findRegex([Short Description],@regex,0)=1

The first three snippets return 113 rows and the last one will return 153 rows. The last one is more generic and will allow more matches because it has two keywords while the other ones had three. Each search takes only a few seconds and will be unnoticeable in a web application.

Note: this regular expression framework, unlike other implementation, considers abbreviation as escape codes when they are within a class.

Example: Validating integer numbers

PRINT dbo.findRegex('123a','^\d+$',0)
PRINT dbo.findRegex('123a','^[\d]+$',0)
PRINT dbo.findRegex('d','^[\d]+$',0)

Output:
1
0
1

The first snippet works fine, it looks for any number in the string and the second one would work on other regular expression engines but not here. The [\d] construct looks like it is defining a class with the abbreviation in it but in reality, it is defining a class with the escaped "d" character. The last example proves it.

Another peculiarity is that {} indicates a match group and not a match for a number of repeated characters. For example: Suppose you wanted to validate a credit card input, assuming that sixteen digits in groups of four, divided by dashes are a valid match. The common solution is to use an expression like ^(\d{4}-){3}\d{4}$ but the same result can be achieved by defining the number of digits explicitly:

PRINT dbo.findRegex('1234-1234-1234-1234','^\d\d\d\d-\d\d\d\d-\d\d\d\d-\d\d\d\d$',0)


The above book excerpt is from:

Super SQL Server Systems
Turbocharge Database Performance with C++ External Procedures

ISBN: 0-9761573-2-2
Joseph Gama, P. J. Naughter

 http://www.rampant-books.com/book_2005_2_sql_server_external_procedures.htm

 


 

 

 

Burleson Consulting Remote DB Administration


 

 


 

 

 

 

 
Burleson is the America's Team

Note: The pages on this site were created as a support and training reference for use by our staff of DBA consultants.  If you find it confusing, please exit this page.

Errata?  SQL Server technology is changing and we strive to update our SQL Server support information.  If you find an error or have a suggestion for improving our content, we would appreciate your feedback.  Just  e-mail:and include the URL for the page.
 


Burleson Consulting
SQL Server database support

 

Copyright © 1996 -  2013 by Vaaltech Web Services. All rights reserved.

Hit Counter