Wednesday, September 05, 2007 12:00 AM bart

Visual Basic 9.0 Feature Focus - LINQ Queries

Welcome back to the Visual Basic 9.0 Feature Focus blog series. In this post, we'll (finally) cover one of .NET Framework 3.5's core features for both C# 3.0 and VB 9.0: LINQ. LINQ stands for Language INtegrated Query and provides querying syntax directly in a general-purpose programming language. Combined with clean and elegant APIs like LINQ to Objects, LINQ to SQL and LINQ to XML (amongst others, like LINQ to SharePoint) this gives developers all the horsepower they need to retrieve and manipulate data from any LINQ-capable data source, also closing the gap between various data models and objects from the world of OO (for instance, by having entity mappings for relational SQL databases, a.k.a. O/R mapping).

So, what's the big idea? In the past, there were various problems related to working with data. One of those problems is the programming language's inawareness of the concept of queries. For example, you typically write code like this:

Using conn As New SqlConnection(dsn)
    Using cmd As New SqlCommand("SELECT * FROM Products WHERE UnitPrice > 100 ORDERBY UnitsInStock DESCENDING", conn)
        Using reader = cmd.ExecuteReader()
            While reader.Read()
                Dim name As String = CType(reader("ProductName"), String)
                Dim price As Decimal = CType(reader("UnitPrice"), Decimal)
                'Use it
            End While
        End Using
    End Using
End Using

What's wrong with this? A couple of things in fact. First of all, the SQL statement is wrapped inside a string which makes it impossible for the compiler to analyze it and check it against the database schema for correctness (well, in theory we could have additional compiler intelligence, like the printf format string checking in many C/C++ compilers, but that doesn't solve the fundamental problem). Secondly, inside the iteration statement, we're faced with weak typing problems: ProductName was a string, is masked as an object and has to be casted back to a String in order to work with it (which doesn't seem too much of a problem if you're working with late binding semantics, but it would be better if we could have strong typing bubbling up to the surface). Another typical problem is query statement parameterization which can be done but (in lots of case) wrong as well. What I'm referring at is a phenomenon known as SQL Injection Attacks that occurs if you're building query statements by simple string concatenation, instead of using parameterized statements or sprocs. Last but not least there's all of this awful plumbing around the query statement in order to execute it and to iterate over the results, which by the way aren't real results: it's a bit of loosely coupled data that you still have to tie together using some O/R mapping to live in an ideal world. So, how can this be improved?

The approach taken by LINQ is unique in many fashions, the core asset being the LIN part: Language INtegrated. Furthermore, with LINQ we do have one single syntax for all sorts of queries, including in-memory ones and queries targeting various other kinds of data sources, like SQL, XML, web services like Amazon, SharePoint, ... In this post, we'll take a closer look at the query syntax and what it corresponds to in terms of extension method calls. Be prepared to get a big deja-vu of most features we did discuss in previous posts, including Extension Methods, Lambda Expressions, Expression Trees, Anonymous Types, Object Initializers, etc. Let's start with a very simple query using LINQ to Objects, otherwise known as in-memory queries:

Dim ints As Integer() = {1, 2, 3, 4, 5}
Dim res = From i In ints Where i Mod 2 = 0 Select i
For Each i In res
    Console.WriteLine(i)
Next

Note: A common question I often get is why the query starts with a From clause, in contrast to SQL statements. The reason why isn't difficult at all: by having the From clause first, the IDE knows about the source you're about to query. Since the source should be an IEnumerable(Of T), each time you refer to an object from that source using the dummy name from the From clause (the .. part in From .. In ...), it knows that one is of type T (from IEnumerable(Of T)). So, the editor can provide IntelliSense on the query statements. If you were to start with a Select clause, how can the editor know what the members are of i (Select i. --> can you suggest any meaningful IntelliSense, ignoring the System.Object members of course?).

The query will return all even numbers, but how does it really work? Let's do what the compiler does: translate the query to another format:

Dim res = ints.Where(Function(i) i Mod 2 = 0).Select(Function(i) i)

Essentially this is a chain of extension method calls on the IEnumerable(Of T) object, combined with lambda expressions. In compiling the original query, the compiler looks for the dummy variable in the From .. In clause and uses that one as the parameter to all subsequent lambdas. Next it takes the expressions followed by Where .., Select .., etc and cooks a lambda out of it, by taking the dummy as the parameter and the original expression as the lambda body. Finally it chains the whole thing together using extension methods calls. Now on to these extensions: what does Where and Select do? First of all, both are extensions to IEnumerable(Of T) and have been defined in System.Linq.Enumerable. So, the code above is equal to:

Dim res = Enumerable.Select(Enumerable.Where(ints, Function(i) i Mod 2 = 0), Function(i) i)

Note: you might wonder why there's a Select call with a trivial projection lambda Function(i) i. If the original item is returned, what's the value of this? The answer is to make a query really a query, i.e. to 'defeat' degenerate queries. If a trivial query like "From i In ints Select i" would be optimized to just "ints" (because the projection doesn't do anything), the source of the query is revealed to the consumer of the query, something which has to be avoided in order to have a clean design (and which is more important when talking about LINQ to SQL kind of queries too).

Quite funny isn't it, to see the order of the methods calls being reversed due to the use of extension methods (tip: people who saw a decent amount of lambda calculus and formal semantics can write down some nice formulas to generalize this idea). The signature of these extension methods is the following:

Function Where(Of T)(ByVal source As IEnumerable(Of T), ByVal predicate As Func(Of T, Boolean)) As IEnumerable(Of T)
Function Select(Of T,R)(ByVal source As IEnumerable(Of T), ByVal projection As Func(Of T, R)) As IEnumerable(Of R)

Both generic methods take in a source as well as a function delegate that transforms the input to something else: the predicate returns a Boolean (should the item be included or not?) and the projection returns an object of another type R (the result of projection T into something else, e.g. an anonymous type).

Evaluation starts with the Enumerable.Where call, which takes the source from its first parameter and iterates over it, evaluating the lambda expression "predicate" for each and every item in the source sequence. If that predicate evaluates to true, it returns the item in the result sequence, which also is of type IEnumerable(Of T). Unfortunately I can't give you sample code of how the Where method works since Visual Basic doesn't have a concept called "iterators" which lies at the basis of this mechanism and is a C# 2.0 (and higher) feature. What is enables is a thing called "lazy evaluation": only when you start iterating over the results of the method, data is fetched from the source. It's like having a pipeline that sucks data out of the source only when it's really needed:

image

 

In the picture above imaging you start iterating over the results of the Select (the end of the query in the sample). As you turn the Select wheel, the Where weel will start to spin around, getting data from the source sequence till a match is found. Then that match bubbles up all the way to the Select output. So, in the source sequence, the following will happen:

1, 2, 3, 4, 5

You start to iterate over Select, which comes down to calling MoveNext on the enumerator (the thing that drives the For Each loop). Which enumerator? The one returned by the Select call. Internally, the Select method sucks data from the Where method by calling the Where method's returned enumerator's MoveNext method. This on its turn triggers an iteration over the source sequence. We've just arrived at the bottom of the call stack, now let's make our way up again:

  • 1 is fetched from the source
    • Where evaluates the predicate and sees 1 doesn't match
  • 2 is fetched from the source
    • Where evaluates the predicate and sees 2 is a match, so it yields back the 2 to its caller
      • Select got 2 from its source, it executes the lambda on it and returns the result (the lambda "projection" being Function(i) i, thus returning i itself)
        • The caller of the iteration receives 2

It's until the caller asks for another iteration cylce that the whole thing is suspended. That's why we call this lazy loading. Assume you have a source sequence of 1 million items and you break the iteration after five cycles: you won't have loaded any more data from the source sequence than is required. This is different than doing simple collection manipulations, reading the entire source sequence and finding all elements that are even, then taking this result and doing the projection for each of those items. In such a case you'd end up with a maximum of 1 million items (all the odd numbers are filtered out), maybe 500,000 or so, but you just needed 5 in the final iteration. So you've just spilled lots of processor cycles and memory space to do more than you really needed to

Note: There are situations however where lazy loading is broken because all elements are needed to carry out some operations. Notable examples are sorting and grouping. Indeed, to sort a sequence of items that have been ordered at random, all items have to be fetched in order to carry out a sort, before the result sequence can be returned to the caller.

So far so good for in-memory queries. What about other queries, like LINQ to SQL queries? Again, the mechanism is very similar. However, a query like the following:

Dim ctx As New NorthwindDataContext()
Dim res = From p In ctx.Products Where p.UnitPrice > 100 Select p.ProductName
For Each p In res
    Console.WriteLine(p)
Next

The first stage of the compilation is the same as previously:

Dim res = ctx.Products.Where(Function(p) p.UnitPrice > 100).Select(Function(p) p.ProductName)

Now it becomes interesting. As you know, Where and Select are extension methods but how does the compiler find the appropriate one? The answer is simple: by looking at the variable on which the method is called. In this case ctx.Products is of type Table(Of T), a class defined in the System.Data.Linq namespace. Table(Of T) doesn't have an instance method called Where, so the compiler starts to look for extension methods, a search based on the type of the object. There are no extension methods for Table(Of T) in scope but since Table(Of T) implements IQueryable(Of T), a class from the System.Linq namespace, the compiler looks for extension methods on IQueryable(Of T) and finds that one is available now (hooray). The signature of this method (and the Select method, since Queryable.Where returns an IQueryable(Of T) again) is:

Function Where(Of T)(ByVal source As IQueryable(Of T), ByVal predicate As Expression(Of Func(Of T, Boolean))) As IQueryable(Of T)
Function Select(Of T,R)(ByVal source As IQueryable(Of T), ByVal projection As Expression(Of Func(Of T, R))) As IQueryable(Of R)

So, the query is now translated into:

Dim res = Queryable.Select(Queryable.Where(ctx.Products, Function(p) p.UnitPrice > 100), Function(p) p.ProductName)

If you read the method signatures carefully, you'll see that the type of res is now IQueryable(Of String), because p.ProductName (the result value of the method call chain produced by the original query) is a string. However, there's more. Notice the second parameters of both methods: these are no longer of type Func(Of ...) but of type Expression(Of Func(Of ...)), which - as we learned before in the episode on Expression Trees - trigger the compiler to generate expression trees for the lambdas. So,

Function(p) p.UnitPrice > 100
Function(p) p.ProductName

both become expression trees, passed in to the IQueryable(Of T) object. Now, when you trigger iteration over the query (results), like this:

For Each p In res
    Console.WriteLine(p)
Next

or indirectly, e.g. by doing databinding sort of operations, you'll cause the IQueryable(Of T) implementation to parse the expression trees and transform these into the appropriate target language, in this case SQL:

SELECT p.ProductName FROM Products WHERE p.UnitPrice > 100

Now you know the basic principles, it's time to take a look at the various query operators that are supported by LINQ (yes, there's more than just Wheres and Selects):

From indicates the source of a query and has the following syntax:

From <dummy> [As <type>] In <source>

    • dummy is a variable name for use throughout the query to denote an item being processes (e.g. in a predicate or in a projection);
    • type can be used to indicate the type of the dummy, i.e. the type of an item in the source sequence; it's not mandatory if the type of the item can be inferred (e.g. when using generic collections IEnumerable(Of T) or IQueryable(Of T));
    • source obviously references the query source, e.g. an in-memory collection or an IQueryable(Of T) object.

Note: More than one From clause can be used in a query.

Select defines a projection clause in a query:

Select <projection>

    • projection can be any expression that returns one object; it could be a dummy from a From clause or something more complex, using the syntax below:

[<alias> = ] <column>[,...]

    • alias is optional and indicates the name of the property that will be generated on the anonymous type representing the projection;
    • column refers to a property or field on a dummy but could be a more complex expression too (in which case an alias is required since no name can be inferred automatically).

Examples:

  • Dim res = From i In New Integer() {1, 2, 3, 4, 5, 6} Select i 'returns 1,2,3,4,5,6
  • Dim res = From i In New Integer() {1, 2} Select Double = 2 * i, Triple = 3 * i 'returns {Double = 2, Triple = 3}, {Double = 4, Triple = 6}

Where is used to filter results based on a predicate:

Where <predicate>

    • predicate is a Boolean-valued expression that can use every variable in scope, including the dummies defined in From clauses.

Example: Dim res = From i In New Integer() {1, 2, 3, 4, 5, 6} Where i Mod 2 = 0 Select 'returns 2,4,6

Order By clauses put an ordering on the query's results:

Order By <ordering> [Ascending | Descending]

    • ordering is an expression used a the key in an ordering; if multiple orderings are specified in a row, subsequent orderings define secondary, ternary, ... and n-ary orderings (for the curious ones, primary orderings are translated into OrderBy and OrderByDescending calls, while n-ary orderings are translated into ThenBy and ThenByDescending calls).

Example:

  • Consider src contains { ID = 1, City = "Ghent", Country = "Belgium" }, { ID = 2, City = "Brussels", Country = "Belgium" }, { ID = 3, City = "Seattle", Country = "USA" }
  • Dim res = From c In src Order By c.Country Descending Order By c.City Select c.ID 'returns 3,2,1

Distinct filters out duplicates from a source sequence in a query:

Distinct

    • returns all distinct items from the input of the clause (i.e. "everything in front of" the Distinct clause)

Example: Dim res = From i In New Integer() {1, 2, 1, 3} Select Distinct 'returns 1,2,3

Skip and Take allow to select a range of items from a source sequence in a query:

Take(<number>)
Skip(<number>)

    • number is an integer value; e.g. Skip(10).Take(5) will select items 11 to 15 from the input sequence

Example: Dim res = From i In New Integer() {1, 2, 3, 4, 5, 6} Select Skip Take 'returns 4,5

Aggregate performs an aggregate, i.e. turning a sequence of items into a singleton value:

Aggregate <dummy> In <source> Into <aggregationList>

    • dummy is a dummy name for use in the aggregation;
    • source can be anything which is enumerable (including a nested subquery);
    • aggregationList is a comma-separated set of aggregation expressions that take the form:

      [<alias> =] <aggregation>[,...]

      If more than one aggregation expression is present, an anonymous type will be generated using the aliases used in the aggregation expression list. The aggregation supports various functions including Min, Max, Sum, Average, ...

Example: Dim res = Aggregate i In (From i In New Integer() {1, 2, 3, 4, 5, 6} Where i Mod 2 = 0 Select i) Into a = Average(), s = Sum() 'returns res.a = 4, res.s = 12

Group is used to make groupings based on a grouping expression; this is slightly more complicated than the query operations above:

Group [<fields>] By <keys> Into <aggregates>

    • fields is an optional comma-separated list and allows to specify the fields that have to be included in the grouped results (you could consider this to be a pre-grouping Select-alike projection);
    • keys is a comma-separated list of one or more name = value pairs, specifying the grouping keys;
    • aggregates is a comma-separated list of expressions that make up the group result and allow for continuation of the query.

A sample will clarify things so much:

Dim res = From s In New String() {"Bart", "Bob", "John"} Group By FirstLetter = s(0) Into Words = Group
For Each letter In res
    Console.WriteLine(letter.FirstLetter)
    For Each word In letter.Words
        Console.WriteLine("- " + word)
    Next
Next

Notice we don't have a fields part in here, but in most cases queries as the one above are complex enough for day-to-day purposes. In this sample, a set of words are grouped by the first letter (FirstLetter) of the word and the items in the group are called Words. Inside the loop, we can refer to FirstLetter as the grouping key and Words as the collection of the items in the corresponding group.

Join allows to join multiple sources together based on some join condition:

Join <dummy> In <source> On <key1> Equals <key2> [And ...]

    • dummy refers to items in the to-be-joined source;
    • source is the source of the items that have to be joined;
    • key1 and key2 are expressions to used in the join condition.

An example based on Northwind looks like this:

Dim res = From p In ctx.Products Join s In ctx.Suppliers On p.Supplier = s Select p.ProductName, s.SupplierName

Joins can be more complex if using "group joins" or when multiple join conditions are involved but let's keep it simple for now :-).

Happy coding!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Comments

# VB 9.0 Feature Focus

Wednesday, September 05, 2007 6:50 PM by Nick's .NET Travels

In my previous post I commented that VB is coming of age in Visual Studio 2008 with better support for

# 23 Links Today (2007-09-07)

Friday, September 07, 2007 8:20 AM by 23 Links Today (2007-09-07)

Pingback from  23 Links Today (2007-09-07)

# New Features of Visual Basic 9 Article Series

Thursday, September 27, 2007 5:47 AM by Walter Stiers - Academic Relations Team (BeLux)

In a series of 15 posts, Bart De Smet explores several of the new features in Visual Basic 9 . These

# Visual Basic 9.0 Feature Focus - LINQ Queries

Monday, November 26, 2007 8:41 AM by Visual Basic 9.0 Feature Focus - LINQ Queries

Pingback from  Visual Basic 9.0 Feature Focus - LINQ Queries

# VB 9.0 Feature Focus – Link Collection

Saturday, August 09, 2008 7:16 AM by B# .NET Blog

Collecting a few of my posts for easy quick reference: Visual Basic 9.0 Feature Focus – Introduction

# How alias column name in LINQ? | keyongtech

Sunday, January 18, 2009 8:35 AM by How alias column name in LINQ? | keyongtech

Pingback from  How alias column name in LINQ? | keyongtech