How to add a new language module with CPD support.
Table of Contents Adding support for a CPD languageCPD works generically on the tokens produced by a CpdLexer
. To add support for a new language, the crucial piece is writing a CpdLexer that splits the source file into the tokens specific to your language. Thankfully you can use a stock Antlr grammar or JavaCC grammar to generate a lexer for you. If you cannot use a lexer generator, for instance because you are wrapping a lexer from another library, it is still relatively easy to implement the Tokenizer interface.
Use the following guide to set up a new language module that supports CPD.
<module>
entry, so that it is built alongside the other languages.CpdLexer
.
For Antlr grammars you can take the grammar from antlr/grammars-v4 and place it in src/main/antlr4
followed by the package name of the language. You then need to call the antlr4 plugin and the appropriate ant wrapper with target cpd-language
to generate the lexer from the grammar. To do so, edit pom.xml
(eg like the Golang module). Once that is done, mvn generate-sources
should generate the lexer sources for you.
You can now implement a CpdLexer, for instance by extending AntlrCpdLexer
. The following reproduces the Go implementation:
// mind the package convention if you are going to make a PR
package net.sourceforge.pmd.lang.go.cpd;
public class GoCpdLexer extends AntlrCpdLexer {
@Override
protected Lexer getLexerForSource(CharStream charStream) {
return new GolangLexer(charStream);
}
}
getImage(AntlrToken)
. There you can change each token e.g. into uppercase, so that CPD sees the same strings and can find duplicates even when the casing differs. See TSqlCpdLexer
for an example. You will also need a âCaseChangingCharStreamâ, so that antlr itself is case-insensitive.etc/grammar
and edit the pom.xml
like the Python implementation does. You can then subclass JavaccCpdLexer
instead of AntlrCpdLexer.IGNORE_CASE=true
), then you need to implement JavaccTokenDocument.TokenDocumentBehavior
, which can change each token e.g. into uppercase. See PLSQLParser
for an example.Create a Language
implementation, and make it implement CpdCapableLanguage
. If your language only supports CPD, then you can subclass CpdOnlyLanguageModuleBase
to get going:
// mind the package convention if you are going to make a PR
package net.sourceforge.pmd.lang.go;
public class GoLanguageModule extends CpdOnlyLanguageModuleBase {
// A public noarg constructor is required.
public GoLanguageModule() {
super(LanguageMetadata.withId("go").name("Go").extensions("go"));
}
@Override
public Tokenizer createCpdLexer(LanguagePropertyBundle bundle) {
// This method should return an instance of the CpdLexer you created.
return new GoCpdLexer();
}
}
To make PMD find the language module at runtime, write the fully-qualified name of your language class into the file src/main/resources/META-INF/services/net.sourceforge.pmd.lang.Language
.
At this point the new language module should be available in CPD
and usable by CPD like any other language.
Update the test that asserts the list of supported languages by updating the SUPPORTED_LANGUAGES
constant in BinaryDistributionIT.
Add some tests for your CpdLexer by following the section below.
Add a page in the documentation. Create a new markdown file <langId>.md
in docs/pages/pmd/languages/
. This file should have the following frontmatter:
---
title: <Language Name>
permalink: pmd_languages_<langId>.html
last_updated: <Month> <Year> (<PMD Version>)
tags: [languages, CpdCapableLanguage]
---
On this page, language specifics can be documented, e.g. when the language was first supported by PMD. There is also the following Jekyll Include, that creates summary box for the language:
{% include language_info.html name='<Language Name>' id='<langId>' implementation='<langId>::lang.<langId>.<langId>LanguageModule' supports_cpd=true %}
To make the CpdLexer configurable, first define some property descriptors using PropertyFactory
. Look at CpdLanguageProperties
for some predefined ones which you can reuse (prefer reusing property descriptors if you can). You need to override newPropertyBundle
and call definePropertyDescriptor
to register the descriptors. After that you can access the values of the properties from the parameter of createCpdTokenizer
.
To implement simple token filtering, you can use BaseTokenFilter
as a base class, or another base class in net.sourceforge.pmd.cpd.impl
. Take a look at the Kotlin token filter implementation, or the Java one.
Add a Maven dependency on pmd-lang-test
(scope test
) in your pom.xml
. This contains utilities to test your CpdLexer.
Create a test class extending from CpdTextComparisonTest
. To add tests, you need to write regular JUnit @Test
-annotated methods, and call the method doTest
with the name of the test file.
For example, for the Dart language:
package net.sourceforge.pmd.lang.dart.cpd;
public class DartTokenizerTest extends CpdTextComparisonTest {
/**********************************
Implementation of the superclass
***********************************/
public DartTokenizerTest() {
super("dart", ".dart"); // the ID of the language, then the file extension used by test files
}
@Override
protected String getResourcePrefix() {
// "testdata" is the default value, you don't need to override.
// This specifies that you should place the test files in
// src/test/resources/net/sourceforge/pmd/lang/dart/cpd/testdata
return "testdata";
}
/**************
Test methods
***************/
@Test // don't forget the JUnit annotation
public void testLiterals() {
// This will look for a file named literals.dart
// in the directory identified by getResourcePrefix,
// tokenize it, then compare the result against a baseline
// literals.txt file in the same directory
// If the baseline file does not exist, it is created automatically
doTest("literals");
}
}
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4