Thursday, November 19, 2009

The Regex & StringTokenizer Challenge!

Sometimes programming is difficult - You have to make large architectural decisions, with overreaching implications on maintainability, performance and security. Other times, programming is simply frustrating - Something 'simple' doesn't work the way you expected, you're seriously testing your logic and finally you bring out the paper and pen for some good-old pseudo code!

The problem...
I was doing some pro-bono processing work on an excel file, removing duplicate rows based on some complicated criteria. Preferring to keep my sanity I abandoned VBA so I exported the spreadsheet to CSV. The crunch is, some of the cells were blank and this produced unpredictable results in my code. Here's a sample of the data:


  1. ,,dsafdf,,15,,,,
  2. ,,fdsfjladsjf,,13,,,,
  3. df,,sdff,,bemail,1,1,,
  4. dsf,,sffdsf,,bsgemail,1,1,,
  5. token1,token2,token3,token4,token5,token6,token7,token8,token9


I've added the fifth row for demonstration of the number of tokens. In a row that is complete, we are expecting 9 tokens. Their position is important!

Source Code
I've included some source code showing 3 different approaches:

  1. import java.util.StringTokenizer;


  2. public class Processor {

  3. public static void main (String[] args) {

  4. final String _DELIM = ",";

  5. //An array of the strings
  6. String[] sourceStrings = {
  7. ",,dsafdf,,15,,,,",
  8. ",,fdsfjladsjf,,13,,,,",
  9. "df,,sdff,,bemail,1,1,,",
  10. "dsf,,sffdsf,,bsgemail,1,1,,",
  11. "token1,token2,token3,token4,token5,token6,token7,token8,token9"
  12. };

  13. System.out.println("Approach One: String.split()");
  14. for(String s: sourceStrings)
  15. {
  16. //do the processing on the strings using
  17. //the delimiter as a regular expression;
  18. String[] split = s.split(_DELIM);

  19. System.out.println("Number of tokens : " + split.length);
  20. }

  21. System.out.println("Approach Two: StringTokenizer");
  22. for(String s: sourceStrings)
  23. {
  24. StringTokenizer st = new StringTokenizer(s, _DELIM, false);

  25. System.out.println("Number of tokens : " + st.countTokens());
  26. }

  27. System.out.println("\r\nApproach Three: StringTokenizer returning delimiter - Take One");
  28. for(String s: sourceStrings)
  29. {

  30. StringTokenizer st = new StringTokenizer(s, _DELIM, true);

  31. System.out.println("Number of tokens : " + st.countTokens());
  32. }
  33. }
  34. }

Output
A listing of the output of the code above is shown below:

  1. Approach One: String.split()
  2. Number of tokens : 5
  3. Number of tokens : 5
  4. Number of tokens : 7
  5. Number of tokens : 7
  6. Number of tokens : 9

  7. Approach Two: StringTokenizer
  8. Number of tokens : 2
  9. Number of tokens : 2
  10. Number of tokens : 5
  11. Number of tokens : 5
  12. Number of tokens : 9

  13. Approach Three: StringTokenizer returning delimiter - Take One
  14. Number of tokens : 10
  15. Number of tokens : 10
  16. Number of tokens : 13
  17. Number of tokens : 13
  18. Number of tokens : 17


Approach One: String.split()
The first approach uses the String.split() method passing in the delimiter as a regular expression. As can be seen, this fails the test as we have ranges of 5-9 tokens. The correct answer should be 9.

Approach Two: StringTokenizer
The second approach uses the StringTokenizer to split up the strings. Again, we get unpredictable results. 2 tokens for the first two lines, as it picks only the items that are not the delimiter. The Correct answer here would be 9 and for the items that are null, we should get a null token.

Approach Three: StringTokenizer returning the delimiter
The third approach uses the StringTokenizer with a different constructor that also returns the delimiter. Again, lots of wrong results. The correct number of tokens should be 17 - 8 commas and 9 non comma items, preferrably null.

So there you have it, a rather simple task (I would have thought), but I did bang my head against the wall trying to find the right solution. If I ever get to be a teacher, this would be great for an exam! There's the challenge! I'll post the solution soon.


UPDATE!

  1. public class NonCollapsingStringTokenizer {  
  2.   
  3.     private String str;  
  4.     private String delim;  
  5.     private int currentPosition;  
  6.   
  7.     private NonCollapsingStringTokenizer() {  
  8.           
  9.     }  
  10.   
  11.     public NonCollapsingStringTokenizer(String str, String delimStr) {  
  12.         this.str = str;  
  13.         this.delim = delimStr;  
  14.     }  
  15.   
  16.     public String nextToken() {  
  17.         int nextDelimPosition = str.length();  
  18.         int delimPosition = str.indexOf(delim, currentPosition);  
  19.         if (delimPosition >= 0 && delimPosition < nextDelimPosition) {  
  20.             nextDelimPosition = delimPosition;  
  21.         }  
  22.         String token = str.substring(currentPosition, nextDelimPosition);  
  23.         currentPosition = nextDelimPosition + 1;  
  24.         return token;  
  25.     }  
  26.   
  27.     public boolean hasMoreTokens() {  
  28.         return (currentPosition < str.length());  
  29.     }  
  30.       
  31.     public int countTokens() {  
  32.         int count = 0;  
  33.           
  34.         NonCollapsingStringTokenizer clone = (NonCollapsingStringTokenizer) this.clone();  
  35.         while(clone.hasMoreTokens())  
  36.         {  
  37.             clone.nextToken();  
  38.             count++;  
  39.         }  
  40.           
  41.         return count;  
  42.     }  
  43.       
  44.     public Object clone() {  
  45.         NonCollapsingStringTokenizer copy = new NonCollapsingStringTokenizer();  
  46.         copy.str = str;  
  47.         copy.delim = delim;  
  48.         return currentPosition;  
  49.           
  50.     }  
  51. }  

The class above handles the challenge accordingly. This can be tested with the code below:

  1. public static void main(String[] args) {  
  2.         String s = ",,dsafdf,,15,,,,";  
  3.           
  4.         final String _DELIM = ",";  
  5.         //s = "token1,token2,token3,token4,token5,token6,token7,token8,token9";  
  6.         s = ",,sffdsf,,bsgemail,1,1,5,dsf";  
  7.                   
  8.         NonCollapsingStringTokenizer ncst = new NonCollapsingStringTokenizer(s, _DELIM);  
  9.         int x = 1;  
  10.         while(ncst.hasMoreTokens()) {  
  11.             String nextToken = ncst.nextToken();  
  12.               
  13.             System.out.println(Integer.toString(x++) + " " + nextToken);  
  14.         }  
  15.     }  

This produces the listing below:

  1. 1   
  2. 2   
  3. 3 sffdsf  
  4. 4   
  5. 5 bsgemail  
  6. 6 1  
  7. 7 1  
  8. 8 5  
  9. 9 dsf  

1 comment:

Anonymous said...

I see you lifted this code from Code Ranch. It has the same problem as the Code Ranch code, if your last column is empty, it doesn't actually register that as another token. This is because the "hasMoreTokens()" method only checks to see that we have reached the end of a string, not if the last part of the string is a delimiter. This is the fixed method:

public boolean hasMoreTokens() {
if (currentPosition < str.length())
return true;
else if(currentPosition == str.length() &&
str.substring(str.length() - delim.length()).equals(delim))
return true;
else
return false;
}