Friday, July 11, 2008

How to remove duplicate primary keys rows from DataTable

Let's say that I have a table in database, for some reason the one who created it didn't set any primary keys, but giving assumption that some columns will be use as primary keys. Because of that condition, somehow the columns that assume to be primary keys can have duplicate value and also can have null value. I want to remove records that have null and duplicate values in primary keys. There are two ways to do it, one is using query another one is loop DataRow and check it programmatically.

This is the first way
  • This is the first way
  • SELECT t.*
  • FROM
  • ( SELECT u.PK1 ,u.PK2 FROM MyTable u
  • WHERE u.PK1 is not null and u.PK2 is not null
  • GROUP BY u.PK1 ,u.PK2
  • HAVING COUNT(*) = 1 ) s
  • LEFT JOIN MyTable t
  • ON t.PK1=s.PK1 and t.PK1=s.PK2
  • WHERE t.PK1 is not null and t.PK2 is not null

You can use inner join too, but in this scenario it is faster to use left join because it doesn't have to check every records for each row. Thanks to Anita for helping me with this query, being inspired too by this article written by Pinal Dave. By the way this query is not the only way, u can use another trick or even use store procedure.

This is the second way, the important thing is YOU MUST SORT the content in dtSource and dtDup base on the primary keys respectively. I noticed that there are different behaviors of DataTable if u create it manually and if u use DataSet then populate the records from database. If u use DataTable that u create manually when u delete the row, then the row will be deleted permanently and the index will changed, but if u use DataSet then it will be marked as deleted it will be deleted if u use DataTable.AcceptChanges. If u use DataSet then look at my code with comment "//THIS LINE SHOULD BE REMOVED", statement "i--;" in that line should be removed, but if you use DataTable that u create manually u need that statement, it took a while for me to test and realize there are something odd with the result before, now it fix :) .
  1. class Program
  2. {
  3. static void Main(string[] args)
  4. {
  5. //DataTable that contain all records
  6. DataTable dtSource = new DataTable();
  7. dtSource.Columns.Add("PK1");
  8. dtSource.Columns.Add("PK2");
  9. dtSource.Columns.Add("Column1");
  10. dtSource.Columns.Add("Column2");
  11. Random rnd = new Random();
  12. int j = 0;
  13. ArrayList arr = new ArrayList();
  14. for (int i = 0; i < 10 ; i ++ )
  15. {
  16. j = rnd.Next(1, 3);
  17. if (j == 2) arr.Add(i);
  18. for (int k = 0; k < j ; k ++ )
  19. {
  20. DataRow dr = dtSource.NewRow();
  21. dr["PK1"] = "PK1" + i;
  22. dr["PK2"] = "PK2" + i;
  23. dr["Column1"] = "C1" + i;
  24. dr["Column2"] = "C2" + i;
  25. dtSource.Rows.Add(dr);
  26. }
  27. }
  28. //DataTable that only contains rows with the duplicate primary keys
  29. DataTable dtDup = new DataTable();
  30. dtDup.Columns.Add("PK1");
  31. dtDup.Columns.Add("PK2");
  32. for (int i = 0; i < arr.Count ; i ++ )
  33. {
  34. DataRow dr = dtDup.NewRow();
  35. dr["PK1"] = "PK1" + arr[i].ToString();
  36. dr["PK2"] = "PK2" + arr[i].ToString();
  37. dtDup.Rows.Add(dr);
  38. }
  39. PrintRows(dtSource);
  40. Console.WriteLine();
  41. Program obj = new Program();
  42. string [] dupPrimary = new string[2]{"PK1", "PK2"};
  43. obj.RemoveDuplicatePrimaryKey( ref dtSource, dtDup, dupPrimary );
  44. PrintRows(dtSource);
  45. Console.ReadLine();
  46. }
  47. private static void PrintRows(DataTable dt)
  48. {
  49. for (int i = 0; i < dt.Rows.Count; i++)
  50. Console.WriteLine("PK1: {0}, PK2: {1}, Column1: {2}, Column2: {3}",
  51. dt.Rows[i]["PK1"], dt.Rows[i]["PK2"],
  52. dt.Rows[i]["Column1"], dt.Rows[i]["Column2"]);
  53. }
  54. public void RemoveDuplicatePrimaryKey
  55. (ref DataTable Source, DataTable Target, string[] PrimaryKeys)
  56. {
  57. DataColumn[] keyColumns = new DataColumn[PrimaryKeys.Length];
  58. for (int i = 0; i < PrimaryKeys.Length; i++)
  59. {
  60. keyColumns[i] = Source.Columns[ PrimaryKeys[i] ];
  61. }
  62. bool bStart = false, bEnd = false;
  63. int j = 0;
  64. bool bDuplicate;
  65. for (int i = 0; i < Source.Rows.Count; i++)
  66. {
  67. DataRow drSource = Source.Rows[i];
  68. bDuplicate = IsDuplicate(drSource, Target.Rows[j], keyColumns);
  69. if (!bDuplicate)
  70. {
  71. if (bEnd) break;
  72. else if (bStart)
  73. {
  74. bStart = false;
  75. j++;
  76. i--;
  77. }
  78. else if (j == 0) continue;
  79. }
  80. else
  81. {
  82. Source.Rows[i].Delete();
  83. bStart = true;
  84. if (j == Target.Rows.Count - 1) bEnd = true;
  85. i--; //THIS LINE SHOULD BE REMOVED
  86. }
  87. }
  88. Source.AcceptChanges();
  89. }
  90. private bool IsDuplicate
  91. (DataRow drSource, DataRow drTarget, DataColumn[] dcPrimaryKeys)
  92. {
  93. bool retVal = true;
  94. foreach (DataColumn column in dcPrimaryKeys)
  95. {
  96. retVal = retVal &&
  97. (drSource[column.ColumnName].ToString() ==
  98. drTarget[column.ColumnName].ToString());
  99. if (!retVal) break;
  100. }
  101. return retVal;
  102. }
  103. }

In the real case, I use query to populate data into dtSource and dtDup, here they are:
  • SELECT PK1, PK2, Column1, Column2
  • FROM MyTable
  • WHERE PK1 is not null and PK2 is not null
  • ORDER BY PK1, PK2
  • SELECT PK1, PK2
  • FROM MyTable
  • WHERE PK1 is not null and PK2 is not null
  • GROUP BY PK1, PK2
  • HAVING COUNT(*) = 1
  • ORDER BY PK1, PK2

No comments: