Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add entry point for extracting datasets from TEI #4

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
5e8ecfd
make gradle build and add github actions
lfoppiano Mar 29, 2024
d545b5d
read grobid-home from configuration
lfoppiano Mar 29, 2024
33648de
disable superfluous tests
lfoppiano Mar 29, 2024
49a07b6
fix build
lfoppiano Mar 29, 2024
5d2872e
add simple test on analyzer to get started
lfoppiano Mar 29, 2024
8bc2987
enable jacoco report
lfoppiano Mar 29, 2024
fd84d88
fix build docker
lfoppiano Mar 29, 2024
ffb5bea
disable docker build for the moment
lfoppiano Mar 29, 2024
bb48f37
add parameter to enable/disable sentence segmentation for TEI processing
lfoppiano Apr 18, 2024
f05f68b
Update docker build (#1)
lfoppiano Apr 26, 2024
981ac95
implement tei processing for datasets
lfoppiano Apr 26, 2024
d668625
fix output JSON streaming
lfoppiano Apr 26, 2024
33d4f13
Merge branch 'master' into add-tei-processing-dataset
lfoppiano May 1, 2024
288850f
add the rest of the processing
lfoppiano May 2, 2024
12dcc37
disable broken tests
lfoppiano May 2, 2024
23c2dd5
add XML JATS entry point
lfoppiano May 2, 2024
0213c78
add CC-BY sample documents
lfoppiano May 2, 2024
52ffc23
revert to the original port
lfoppiano May 2, 2024
4448437
enable TEI processing in UI - javascript joy
lfoppiano May 2, 2024
4aad23d
correct parameter
lfoppiano May 2, 2024
6989335
attach URLs obtained from Grobid's TEI
lfoppiano May 6, 2024
7f0cdd5
fix frontend
lfoppiano May 7, 2024
1c5ff72
fix github action
lfoppiano May 7, 2024
4cd7390
fix wrong ifs - thanks intellij!
lfoppiano May 9, 2024
df86b81
avoid exception when entities are empty
lfoppiano May 9, 2024
843463c
avoid injecting null stuff
lfoppiano May 9, 2024
1b1da5f
reduce the timeout for checking the disambiguation service
lfoppiano May 12, 2024
75dd711
fix the convention for sentence segmentation and enable it
lfoppiano May 20, 2024
758f418
update examples
lfoppiano May 21, 2024
91fe70d
add sequence (sentence, paragraph) identifier in each mention
lfoppiano May 21, 2024
cc1cd2a
Fix sentence switch
lfoppiano May 21, 2024
c58502e
Fix incorrect xpath on children
lfoppiano May 23, 2024
6977bda
Cleanup text when extracting from XML, normalise unicode character, r…
lfoppiano Jun 4, 2024
cc01140
Fix bug in the xpaths that were used wrongly to select sentences or p…
lfoppiano Jun 4, 2024
3c3af44
Try to get possible sections in the <back> in which the das is hidden…
lfoppiano Jun 4, 2024
7b6fe06
update to grobid 0.8.1, and catch up other changes
lfoppiano Sep 14, 2024
2162720
retrieve URLs from the TEI XML in all the sections that are of interest
lfoppiano Oct 13, 2024
a2b5bbb
update github actions
lfoppiano Oct 13, 2024
e3a4890
fix xpath to fall back into div into TEI/back
lfoppiano Oct 13, 2024
371f520
cleanup
lfoppiano Oct 13, 2024
1483aab
fix reference mapping
lfoppiano Oct 13, 2024
4ab67a6
fix references extraction
lfoppiano Oct 14, 2024
774dd78
fix regression
lfoppiano Oct 22, 2024
b18454b
cosmetics
lfoppiano Oct 22, 2024
962f7eb
fix regressions in the way we attach references from TEI
lfoppiano Oct 22, 2024
3b343c6
allow xml:id to be string using a wrapper that generates integer to m…
lfoppiano Jan 1, 2025
f58c493
fix extraction of urls that are not well formed (supplementary-materi…
lfoppiano Jan 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
allow xml:id to be string using a wrapper that generates integer to m…
…aintain the compatibility with the rest of the processing
  • Loading branch information
lfoppiano committed Jan 1, 2025
commit 3b343c6fc9867a65df7ca19a2433961b154cc390
17 changes: 16 additions & 1 deletion Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,22 @@ curl --form input=@./src/test/resources/PMC1636350.pdf --form disambiguate=1 loc

For PDF, each entity will be associated with a list of bounding box coordinates relative to the PDF, see [here](https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/#coordinate-system-in-the-pdf) for more explanation about the coordinate system.

In addition, the response will contain the bibliographical reference information associated to a dataset mention when found. The bibliographical information are provided in XML TEI (similar format as GROBID).
In addition, the response will contain the bibliographical reference information associated to a dataset mention when found.
The bibliographical information are provided in XML TEI (similar format as GROBID).

#### /service/annotateDatasetTEI

This entry-point consumes the TEI-XML file from Grobid or pub2tei.

| method | request type | response type | parameters | requirement | description |
|--------|-----------------------|--------------------|--------------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| POST | `multipart/form-data` | `application/json` | `input` | required | TEI file to be processed |
| | | | `segmentSentences` | optional | Indicate whether to apply sentence segmentation. If the TEI was segmented before (by Grobid, for example) this should be set to '0'. |

[//]: # (| | | | `disambiguate` | optional | `disambiguate` is a string of value `0` &#40;no disambiguation, default value&#41; or `1` &#40;disambiguate and inject Wikidata entity id and Wikipedia pageId&#41; |)


Using ```curl``` POST request with a __TEI-XML file__:


## Contact and License
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/grobid/core/data/BiblioComponent.java
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ public class BiblioComponent extends DatasetComponent {
// the full matched bibliographical reference record
protected BiblioItem biblio = null;

// identifier for relating callout and reference, should be cconsistent with
// identifier for relating callout and reference, should be consistent with
// a full text TEI produced by GROBID
protected int refKey = -1;

Expand Down
49 changes: 49 additions & 0 deletions src/main/java/org/grobid/core/data/BiblioComponentWrapper.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
package org.grobid.core.data;

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger;

public class BiblioComponentWrapper {
private Map<String, Integer> stringToRefKeyMap;
private Map<Integer, String> refKeyToStringMap;
private AtomicInteger refKeyGenerator;

public BiblioComponentWrapper() {
stringToRefKeyMap = new HashMap<>();
refKeyToStringMap = new HashMap<>();
refKeyGenerator = new AtomicInteger(0);
}

public void addMapping(String refKeyString) {
if (!stringToRefKeyMap.containsKey(refKeyString)) {
int refKey = refKeyGenerator.incrementAndGet();
stringToRefKeyMap.put(refKeyString, refKey);
refKeyToStringMap.put(refKey, refKeyString);
}
}

public Integer getRefKey(String refKeyString) {
String refKeyStringClean = refKeyString.replaceFirst("^#", "");
addMapping(refKeyStringClean);
return stringToRefKeyMap.get(refKeyStringClean);
}

public String getRefKeyString(int refKey) {
return refKeyToStringMap.get(refKey);
}

public void removeMapping(String refKeyString) {
Integer refKey = stringToRefKeyMap.remove(refKeyString);
if (refKey != null) {
refKeyToStringMap.remove(refKey);
}
}

public void removeMapping(int refKey) {
String refKeyString = refKeyToStringMap.remove(refKey);
if (refKeyString != null) {
stringToRefKeyMap.remove(refKeyString);
}
}
}
32 changes: 24 additions & 8 deletions src/main/java/org/grobid/core/engines/DatasetParser.java
Original file line number Diff line number Diff line change
Expand Up @@ -962,7 +962,7 @@ public Pair<List<List<Dataset>>, Document> processPDF(File file,
TaggingLabel clusterLabel = cluster.getTaggingLabel();

List<LayoutToken> localTokenization = cluster.concatTokens();
if ((localTokenization == null) || (localTokenization.size() == 0))
if (CollectionUtils.isEmpty(localTokenization))
continue;

if (clusterLabel.equals(TaggingLabels.CITATION_MARKER)) {
Expand Down Expand Up @@ -1937,7 +1937,10 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
}

// Read and parse references
Map<String, Pair<String, org.w3c.dom.Node>> referenceMap = new HashMap<>();

BiblioComponentWrapper biblioComponentWrapper = new BiblioComponentWrapper();

Map<Integer, Pair<String, org.w3c.dom.Node>> referenceMap = new HashMap<>();
try {
String expression = "//*[local-name() = 'div'][@*[local-name()='type' and .='references']]/*[local-name() = 'listBibl']/*[local-name() = 'biblStruct']";
org.w3c.dom.NodeList bodyNodeList = (org.w3c.dom.NodeList) xPath.evaluate(expression,
Expand All @@ -1953,7 +1956,7 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
String referenceText = item.getTextContent();
String normalizedReferenceText = normalize(referenceText);
String cleanedRawReferenceText = normalizedReferenceText.replaceAll("\\p{Space}+", " ").strip().replaceAll("[ ]{2,}", ", ");
referenceMap.put(attribute.getNodeValue(), Pair.of(cleanedRawReferenceText, item));
referenceMap.put(biblioComponentWrapper.getRefKey(attribute.getNodeValue()), Pair.of(cleanedRawReferenceText, item));
}
}
}
Expand All @@ -1974,6 +1977,18 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue)))
.collect(Collectors.toList());

// List<Map<String, Triple<OffsetPosition, String, String>>> referencesInSequences = selectedSequences.stream()
// .map(sequence -> sequence.getReferences().entrySet().stream()
// .filter(entry -> BIBLIO_CALLOUT_TYPE.equals(entry.getValue().getRight()))
// .collect(
// Collectors.toMap(
// entry -> String.valueOf(biblioComponentWrapper.getRefKey(entry.getValue().getMiddle())),
// Map.Entry::getValue
// )
// )
// )
// .collect(Collectors.toList());

// List<Map<String, Triple<OffsetPosition, String, String>>> referencesList = selectedSequences.stream()
// .map(DatasetDocumentSequence::getReferences)
// .filter(map -> map.values().stream()
Expand All @@ -1990,15 +2005,16 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
String target = infos.getMiddle();
OffsetPosition position = infos.getLeft();

Pair<String, org.w3c.dom.Node> referenceInformation = referenceMap.get(target);
Pair<String, org.w3c.dom.Node> referenceInformation = referenceMap.get(biblioComponentWrapper.getRefKey(target));
if (referenceInformation != null) {
BiblioItem biblioItem = XMLUtilities.parseTEIBiblioItem((org.w3c.dom.Element) referenceInformation.getRight());
String refTextClean = refText.replaceAll("[\\[\\], ]+", "");

biblioRefMap.put(refTextClean, biblioItem);
BiblioComponent biblioComponent = new BiblioComponent(
biblioItem, Integer.parseInt(target.replace("b", ""))
);

Integer refKey = biblioComponentWrapper.getRefKey(target);

BiblioComponent biblioComponent = new BiblioComponent(biblioItem, refKey);
biblioComponent.setRawForm(refText);
biblioComponent.setOffsetStart(position.start);
biblioComponent.setOffsetEnd(position.end);
Expand Down Expand Up @@ -2238,7 +2254,7 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
return Pair.of(entities, citationsToConsolidate);
}

private static String normalize(String text) {
public static String normalize(String text) {
String normalizedText = UnicodeUtil.normaliseText(text);
normalizedText = normalizedText.replace("\n", " ");
normalizedText = normalizedText.replace("\t", " ");
Expand Down
61 changes: 38 additions & 23 deletions src/main/java/org/grobid/core/utilities/XMLUtilities.java
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@
import java.io.StringWriter;
import java.util.*;

import static org.grobid.core.engines.DatasetParser.normalize;

/**
* Some convenient methods for suffering a bit less with XML.
*/
Expand Down Expand Up @@ -82,7 +84,7 @@ public static String toPrettyString(String xml, int indent) {

public static Element getFirstDirectChild(Element parent, String name) {
for(Node child = parent.getFirstChild(); child != null; child = child.getNextSibling()) {
if (child instanceof Element && name.equals(child.getNodeName()))
if (child instanceof Element && name.equals(child.getNodeName()))
return (Element) child;
}
return null;
Expand All @@ -91,8 +93,8 @@ public static Element getFirstDirectChild(Element parent, String name) {
public static Element getLastDirectChild(Element parent, String name) {
NodeList children = parent.getChildNodes();
for(int j=children.getLength()-1; j>0; j--) {
Node child = children.item(j);
if (child instanceof Element && name.equals(child.getNodeName()))
Node child = children.item(j);
if (child instanceof Element && name.equals(child.getNodeName()))
return (Element) child;
}
return null;
Expand Down Expand Up @@ -123,7 +125,7 @@ public static BiblioItem parseTEIBiblioItem(org.w3c.dom.Element biblStructElemen
} catch(Exception e) {
if (teiXML != null)
LOGGER.warn("The parsing of the biblStruct from TEI document failed for: " + teiXML);
else
else
LOGGER.warn("The parsing of the biblStruct from TEI document failed for: " + biblStructElement.toString());
}
return handler.getBiblioItem();
Expand All @@ -138,7 +140,7 @@ public static String getTextNoRefMarkers(Element element) {
if (node.getNodeType() == Node.ELEMENT_NODE) {
if ("ref".equals(node.getNodeName()))
continue;
}
}
if (node.getNodeType() == Node.TEXT_NODE) {
buf.append(node.getNodeValue());
found = true;
Expand All @@ -147,6 +149,19 @@ public static String getTextNoRefMarkers(Element element) {
return found ? buf.toString() : null;
}

public static String getTextRecursively(Node node) {
StringBuilder textContent = new StringBuilder();
NodeList children = node.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
if (child.getNodeType() == Node.TEXT_NODE) {
textContent.append(child.getNodeValue());
} else if (child.getNodeType() == Node.ELEMENT_NODE) {
textContent.append(getTextRecursively(child));
}
}
return textContent.toString();
}
/**
* @return Pair with text or null on the left and a Triple with (position, target and type)
*/
Expand Down Expand Up @@ -181,16 +196,16 @@ public static Pair<String, Map<String,Triple<OffsetPosition, String, String>>> g
for (int j = 0; j < list2.getLength(); j++) {
Node subChildNode = list2.item(j);
if (subChildNode.getNodeType() == Node.TEXT_NODE) {
String chunk = subChildNode.getNodeValue();
String chunk = normalize(getTextRecursively(node));

if (BIBLIO_CALLOUT_TYPE.equals(((Element) node).getAttribute("type"))) {
Triple<OffsetPosition, String, String> refInfo = Triple.of(new OffsetPosition(indexPos, indexPos+chunk.length()), target, BIBLIO_CALLOUT_TYPE);
right.put(chunk, refInfo);
right.put(StringUtils.strip(chunk), refInfo);
String holder = StringUtils.repeat(" ", chunk.length());
buf.append(holder);
} else if (URI_TYPE.equals(((Element) node).getAttribute("type")) || URL_TYPE.equals(((Element) node).getAttribute("type"))) {
org.apache.commons.lang3.tuple.Triple<OffsetPosition, String, String> urlInfo = org.apache.commons.lang3.tuple.Triple.of(new OffsetPosition(indexPos, indexPos+chunk.length()), target, URL_TYPE);
right.put(chunk, urlInfo);
right.put(StringUtils.strip(chunk), urlInfo);
// we still add added like normal text
buf.append(chunk);
found = true;
Expand Down Expand Up @@ -254,8 +269,8 @@ public static String serialize(org.w3c.dom.Document doc, Node node) {
XPathFactory xpathFactory = XPathFactory.newInstance();
// XPath to find empty text nodes.
XPathExpression xpathExp = xpathFactory.newXPath().compile(
"//text()[normalize-space(.) = '']");
NodeList emptyTextNodes = (NodeList)
"//text()[normalize-space(.) = '']");
NodeList emptyTextNodes = (NodeList)
xpathExp.evaluate(doc, XPathConstants.NODESET);

// Remove each empty text node from document.
Expand Down Expand Up @@ -368,7 +383,7 @@ public static void cleanXMLCorpus(String documentPath) throws Exception {
// Return pretty print xml string
StringWriter stringWriter = new StringWriter();
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));

// write result to file
FileUtils.writeStringToFile(outputFile, stringWriter.toString(), "UTF-8");

Expand All @@ -386,7 +401,7 @@ public static void cleanXMLCorpus(String documentPath) throws Exception {

/**
* Return the document ID where the annotation is located
*/
*/
private static String getDocIdFromRs(org.w3c.dom.Node node) {
String result = null;
// first go up to the tei element root
Expand Down Expand Up @@ -423,11 +438,11 @@ private static String getDocIdFromRs(org.w3c.dom.Node node) {
}

public static String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer();
char current;
StringBuffer out = new StringBuffer();
char current;

if (in == null || ("".equals(in)))
return "";
if (in == null || ("".equals(in)))
return "";
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
Expand All @@ -439,7 +454,7 @@ public static String stripNonValidXMLCharacters(String in) {
out.append(current);
}
return out.toString();
}
}

private static List<String> textualElements = Arrays.asList("p", "figDesc");

Expand All @@ -451,7 +466,7 @@ public static void segment(org.w3c.dom.Document doc, Node node) {
final NodeList children = node.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
final Node n = children.item(i);
if ( (n.getNodeType() == Node.ELEMENT_NODE) &&
if ( (n.getNodeType() == Node.ELEMENT_NODE) &&
(textualElements.contains(n.getNodeName())) ) {
// text content
//String text = n.getTextContent();
Expand Down Expand Up @@ -492,7 +507,7 @@ public static void segment(org.w3c.dom.Document doc, Node node) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
org.w3c.dom.Document d = factory.newDocumentBuilder().parse(new InputSource(new StringReader(fullSent)));
org.w3c.dom.Document d = factory.newDocumentBuilder().parse(new InputSource(new StringReader(fullSent)));
} catch(Exception e) {
fail = true;
}
Expand All @@ -509,7 +524,7 @@ public static void segment(org.w3c.dom.Document doc, Node node) {
//System.out.println("-----------------");
sent = sent.replace("\n", " ");
sent = sent.replaceAll("( )+", " ");

//Element sentenceElement = doc.createElement("s");
//sentenceElement.setTextContent(sent);
//newNodes.add(sentenceElement);
Expand Down Expand Up @@ -539,12 +554,12 @@ public static void segment(org.w3c.dom.Document doc, Node node) {
if (n.getNodeName().equals("figDesc")) {
Element theDiv = doc.createElementNS("http://www.tei-c.org/ns/1.0", "div");
Element theP = doc.createElementNS("http://www.tei-c.org/ns/1.0", "p");
for(Node theNode : newNodes)
for(Node theNode : newNodes)
theP.appendChild(theNode);
theDiv.appendChild(theP);
n.appendChild(theDiv);
} else {
for(Node theNode : newNodes)
for(Node theNode : newNodes)
n.appendChild(theNode);
}

Expand All @@ -561,7 +576,7 @@ public static void segment(org.w3c.dom.Document doc, Node node) {
* @param args Command line arguments.
*/
public static void main(String[] args) {

// we are expecting one argument, absolute path to the TEICorpus document

if (args.length != 1) {
Expand Down
Loading