Sunday 5 December 2010

Using Linq to remove duplicate data from a DataTable

In response to a question on experts exchange I just wanted to do a quick post on data cleanup using Linq

Linq has a lot of very useful operations that can be used to reduce the complexity of array (or dataset) operation.

In this instance the operator .Distinct() can be used to remove duplicates in a datatable, with a small priviso:

If you call the method on the datatable (using the call .AsEnumerable() to get it in to a form you can work with in linq) with no parameters you will find that it returns all rows - this is because by default it does an object comparison, and of course all the records have their own memory address, and therefore are considered to be different. The trick is to tell it to use the default DataRowComparer which will get it to do a column by column value check.

One final point to watch out for is in getting the data back in to datatable - remember that datasets/tables/rows are reference types, so clearing down the tabe will actually cause problems, so the trick is to take a copy of the datatable, clear the original, operate on the copy, then reimport the rows to the cleared down table.

All put together,  the following function will handle this for you:

/// <summary>/// Removes duplicate rows from the target table/// </summary>/// <param name="sourceDataSet">Dataset containing the table to operate on</param>/// <param name="targetTableName">Name of the table to operate on</param>
public void GetUniqueRows(DataSet sourceDataSet, String targetTableName)  //Check there is data to operate on

   //Take a copy of the table so clearing does not affect the operation
   //Run the distinct operation
   //Clear all data from the target table  sourceDataSet.Tables[targetTableName].Clear();

  //Re-import the distinct rows to the 'live' table
  foreach (DataRow DistinctRow in OutputRows)
  DataRow[] OutputRows = (from rows in OperatingData.AsEnumerable() select rows).Distinct(System.Data.DataRowComparer.Default).ToArray() as DataRow[];
  DataTable OperatingData = sourceDataSet.Tables[targetTableName].Copy();
  if (sourceDataSet.Tables[targetTableName].Rows.Count == 0)    return;

No comments:

Post a Comment