Troubleshooting SQL index performance on varchar columns

Doing a deployment last night, I ran into an issue around indexing performance around SQL columns of type varchar. Varchar is the ANSI version of character data – storing as 8-bits, while nvarchar is Unicode, storing as 16 bits. For this database, we typically opt in for varchar, as it supports a US chain of retail stores, and has millions (sometimes hundreds of millions) of rows. Storage on a database like this is important, so we mind our strings.

I ran into a case recently where I was running a very simple query to find transactions by members. This table contained all in-store transactions for a nationwide retail chain. Not too bad of a query, the table has a lot of rows, but not a crazy number of 42 million rows for 18 months of production data. Something SQL Server should be able to handle without any problems whatsoever. So I ran a simple query:

SELECT * FROM [Transactions] WHERE [AccountId]=@1

This is more or less what came straight out of my ORM. However, the query was SLOOOOOOOW. As in minutes slow. First check, let’s look at the table to make sure the index is defined appropriately. We had this index:

CREATE NONCLUSTERED INDEX [IX_TransactionsAccountId] ON [dbo].[Transactions]
(
   [AccountId] ASC
)

So far so good, we have an index defined, but why is my query so slow? I checked all the statistics and what not and found that the index seemed to work OK. I dropped down to SQL Management Studio to try the query out.

To my surprise, it was blazing fast. What was the difference? I used SQL Profiler to see what the actual query being sent down was. After a bit of an epiphany there, I ran these two queries:

SELECT *
FROM Transactions
WHERE AccountId = '9876543210987654'
GO
SELECT *
FROM Transactions
WHERE AccountId = N'9876543210987654'

And wouldn’t you know it, the first query was fast and the second slow. But why was the second one so slow? And why did this matter? Looking at the execution plans in SSMS:

image

So seemingly identical queries, but the execution plans are much, much different! The difference between the two queries is the first uses ANSI strings, while the second uses Unicode strings. The first results in an index seek (very fast) while the second an index scan (absolutely horrible, sequential scanning). For each row, SQL Server would convert the index’s value to NVARCHAR, as seen in the index plan’s details around the predicate:

CONVERT_IMPLICIT(nvarchar(200),[dbo].[Transactions].[AccountId],0)=N'9876543210987654'

Of course it’s going to be slow! Every source value is converted to NVARCHAR, and then compared to the source value, instead of vice versa. This is because Unicode –> ANSI is lossy, and therefore not able to be implicitly converted. So the ANSI string has to be upconverted to Unicode.

Anyway, going to our ORM and forcing the SQL parameter type to be an ANSI string fixed the problem. And once we found this one, we found a dozen other places that we I had screwed up.

Lessons learned

One way to fix this would be to use all Unicode, all the time. For me, that makes sense for a lot of character data. But for data that really just happens to be character data, but actually originates from automated systems, it doesn’t make as much sense to make these Unicode. For tables with large numbers of rows, and indexes on character-based columns, space does matter. This is speaking from a person that has had databases run out of space, it can be a problem that I need to care about.

The other lesson is that ORMs aren’t a tool to hide the database, they’re there to abstract some pieces and encapsulate others. But it’s still imperative I know what is going on between my system and the database, no matter what form it takes.

And finally – everything performs great on a local database with 10 rows. If I can test queries that I know are going to scan anything besides a primary key, I can eliminate this problem before going to production and having to do an emergency deploy because the website was spinning.

Related Articles:

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

About Jimmy Bogard

I'm a technical architect with Headspring in Austin, TX. I focus on DDD, distributed systems, and any other acronym-centric design/architecture/methodology. I created AutoMapper and am a co-author of the ASP.NET MVC in Action books.
This entry was posted in Architecture. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • http://twitter.com/PhatBoyG Chris Patterson

    Yes, using something like:

    Map(x => x.ControlNumber, “ControlNo”).Length(20).CustomType(“AnsiString”);

    with NHibernate is mandatory if you’re using varchar instead of nvarchar, or your performance will absolutely suck.

  • cbp

    Wow, I know what I’ll be doing first thing tomorrow morning then! Thanks!

  • Larry Silverman

    Hi, Jimmy. Are you using Enterprise Edition? If so, you can enable data compression and use Unicode for simplicity and consistency everywhere.

  • Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1149

  • Pingback: My Java Coding Still Has A Problem? | cheap airfare flights, cheap airline tickets, cheap package vacation deals

  • Andrei Rinea

    That’s why I hate ORMs. They solve n problems and create n+1.

    • Anonymous

      Ha! What’s the alternative?

  • Michael K. Campell

    Jimmy, this is a text-book example of a problem that DBAs and perf gurus typically call data-type coercion – where data-type mismatches (as you clearly outlined) invalidate the use of an index. 

    And, as an FYI, you can EVEN run into this in situations where the column is, say, nvarchar(20), and @p1 is defined as nvarchar(10) – because you’ll run into the same problem with coercion. 

  • James Banner

    Awesome post, I didn’t knew it, learned something new.

    Just goes to show the software engineer’s dream to “hide” complexities or to make “persistence ignorance” as far as database goes is just a pipe dream which we can live in, until a problem comes knocking.

  • http://twitter.com/ssboisen Simon Skov Boisen

    Had a similar experience the other day where I were accidentally comparing a string with an integer (i simply forgot to put ticks around the number when I did the query) which meant that for every single row MSSQL were converting the string to a integer. I looked at the query plans, everything looked as it should – that really made me scratch my head until one of the 
    database gurus in my company helped me find the error simple as it were.

  • Erik

    Excellent! Just sent it over to our developers :)

  • Marc Scheuner

    Interestingly, with Entity Framework (an ORM), you *can* define whether or not to use Unicode strings – with things like data annotations on your model classes, or with .AsNonUnicode(). On the other hand, with *RAW* ADO.NET (which certainly doesn’t “hide” the database in any way, shape or form), you can’t seem to define this – it will always use N’…..’ for your search parameters – even if the underlying column is a “varchar” – thus killing your query performance …..

  • Dave

    Hi,

    Could this is due to your collation?

    If its Latin1_General_CI_AS then joining between VarChar and nVarChar still uses an index seek. However if your using the SQL_ collations then this can cause the index scan.

    Here’s a link for Jonathan Kehayias blog where he’s detailed this.
    http://www.sqlskills.com/blogs/jonathan/implicit-conversions-that-cause-index-scans/

    I agree that link fields should ideally not be unicode as this could double you index size.