Affects PMD Version: 6.x
Description:
Some languages like PL/SQL or the new T-SQL (#4390) are case-insensitive. When tokenizing, this is working correctly, e.g. the lexers are agnostic to casing. JavaCC has a grammar option and ANTLR since 4.10 as well.
However, when we convert the original tokens into CPD TokenEntries, we don't seem to use the token kind and use the original token text, which contains the original casing. It's therefore very easy to work around duplicated for these languages by just changing the casing:
echo 'select a, b, c, d, e, f from table where x = 1 and y = 2;' > file1.plsql cp file1.plsql file2.plsql echo 'sEleCt a, b, c, d, e, f frOm table where x = 1 and y = 2;' > file3.plsql run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file2.plsql
results correctly in:
Found a 1 line (23 tokens) duplication in the following files:
Starting at line 1 of /home/andreas/temp/plsql/file1.plsql
Starting at line 1 of /home/andreas/temp/plsql/file2.plsql
select a, b, c, d, e, f from table where x = 1 and y = 2;
since file1.plsql and file2.plsql are identical.
However, comparing file1.plsql and file3.plsql which differ only in casing, shows no duplications:
run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file3.plsql
I think, this problem affects both JavaCC and ANTLR based languages.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4