Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add entry point for extracting datasets from TEI #4

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
5e8ecfd
make gradle build and add github actions
lfoppiano Mar 29, 2024
d545b5d
read grobid-home from configuration
lfoppiano Mar 29, 2024
33648de
disable superfluous tests
lfoppiano Mar 29, 2024
49a07b6
fix build
lfoppiano Mar 29, 2024
5d2872e
add simple test on analyzer to get started
lfoppiano Mar 29, 2024
8bc2987
enable jacoco report
lfoppiano Mar 29, 2024
fd84d88
fix build docker
lfoppiano Mar 29, 2024
ffb5bea
disable docker build for the moment
lfoppiano Mar 29, 2024
bb48f37
add parameter to enable/disable sentence segmentation for TEI processing
lfoppiano Apr 18, 2024
f05f68b
Update docker build (#1)
lfoppiano Apr 26, 2024
981ac95
implement tei processing for datasets
lfoppiano Apr 26, 2024
d668625
fix output JSON streaming
lfoppiano Apr 26, 2024
33d4f13
Merge branch 'master' into add-tei-processing-dataset
lfoppiano May 1, 2024
288850f
add the rest of the processing
lfoppiano May 2, 2024
12dcc37
disable broken tests
lfoppiano May 2, 2024
23c2dd5
add XML JATS entry point
lfoppiano May 2, 2024
0213c78
add CC-BY sample documents
lfoppiano May 2, 2024
52ffc23
revert to the original port
lfoppiano May 2, 2024
4448437
enable TEI processing in UI - javascript joy
lfoppiano May 2, 2024
4aad23d
correct parameter
lfoppiano May 2, 2024
6989335
attach URLs obtained from Grobid's TEI
lfoppiano May 6, 2024
7f0cdd5
fix frontend
lfoppiano May 7, 2024
1c5ff72
fix github action
lfoppiano May 7, 2024
4cd7390
fix wrong ifs - thanks intellij!
lfoppiano May 9, 2024
df86b81
avoid exception when entities are empty
lfoppiano May 9, 2024
843463c
avoid injecting null stuff
lfoppiano May 9, 2024
1b1da5f
reduce the timeout for checking the disambiguation service
lfoppiano May 12, 2024
75dd711
fix the convention for sentence segmentation and enable it
lfoppiano May 20, 2024
758f418
update examples
lfoppiano May 21, 2024
91fe70d
add sequence (sentence, paragraph) identifier in each mention
lfoppiano May 21, 2024
cc1cd2a
Fix sentence switch
lfoppiano May 21, 2024
c58502e
Fix incorrect xpath on children
lfoppiano May 23, 2024
6977bda
Cleanup text when extracting from XML, normalise unicode character, r…
lfoppiano Jun 4, 2024
cc01140
Fix bug in the xpaths that were used wrongly to select sentences or p…
lfoppiano Jun 4, 2024
3c3af44
Try to get possible sections in the <back> in which the das is hidden…
lfoppiano Jun 4, 2024
7b6fe06
update to grobid 0.8.1, and catch up other changes
lfoppiano Sep 14, 2024
2162720
retrieve URLs from the TEI XML in all the sections that are of interest
lfoppiano Oct 13, 2024
a2b5bbb
update github actions
lfoppiano Oct 13, 2024
e3a4890
fix xpath to fall back into div into TEI/back
lfoppiano Oct 13, 2024
371f520
cleanup
lfoppiano Oct 13, 2024
1483aab
fix reference mapping
lfoppiano Oct 13, 2024
4ab67a6
fix references extraction
lfoppiano Oct 14, 2024
774dd78
fix regression
lfoppiano Oct 22, 2024
b18454b
cosmetics
lfoppiano Oct 22, 2024
962f7eb
fix regressions in the way we attach references from TEI
lfoppiano Oct 22, 2024
3b343c6
allow xml:id to be string using a wrapper that generates integer to m…
lfoppiano Jan 1, 2025
f58c493
fix extraction of urls that are not well formed (supplementary-materi…
lfoppiano Jan 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix references extraction
(cherry picked from commit 27194da)
  • Loading branch information
lfoppiano committed Oct 14, 2024
commit 4ab67a61b13867d1cb70aa085f6653e98f6f9446
2 changes: 1 addition & 1 deletion src/main/java/org/grobid/core/data/Dataset.java
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ public void setBibRefs(List<BiblioComponent> bibRefs) {

public void addBibRef(BiblioComponent bibRef) {
if (bibRefs == null) {
bibRefs = new ArrayList<BiblioComponent>();
bibRefs = new ArrayList<>();
}
bibRefs.add(bibRef);
}
Expand Down
121 changes: 73 additions & 48 deletions src/main/java/org/grobid/core/engines/DatasetParser.java
Original file line number Diff line number Diff line change
Expand Up @@ -1690,8 +1690,7 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
localSequence.setRelevantSectionsImplicitDatasets(true);
selectedSequences.add(localSequence);

// Capture URLs if available

// Capture URLs and references if available
Map<String, Triple<OffsetPosition, String, String>> referencesInText = XMLUtilities.getTextNoRefMarkersAndMarkerPositions((org.w3c.dom.Element) item, 0).getRight();
localSequence.setReferences(referencesInText);
}
Expand Down Expand Up @@ -1873,7 +1872,7 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do


try {
String expression = "//*[local-name() = 'text']/*[local-name() = 'back']/*[local-name() = 'div'][not(@type) or (" + String.join(" and ", specificSectionTypesAnnex.stream().map(type-> "not(contains(@type, '"+type+"'))").collect(Collectors.joining())) + ")]/*[local-name()='div']/*[local-name() = 'p']";
String expression = "//*[local-name() = 'text']/*[local-name() = 'back']/*[local-name() = 'div'][not(@type) or (" + String.join(" and ", specificSectionTypesAnnex.stream().map(type -> "not(contains(@type, '" + type + "'))").collect(Collectors.joining())) + ")]/*[local-name()='div']/*[local-name() = 'p']";
expression = extractParagraphs ? expression : expression + "/*[local-name() = 's']";
org.w3c.dom.NodeList annexNodeList = (org.w3c.dom.NodeList) xPath.evaluate(expression,
doc,
Expand Down Expand Up @@ -1981,6 +1980,8 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
Pair<String, org.w3c.dom.Node> referenceInformation = referenceMap.get(target);
if (referenceInformation != null) {
BiblioItem biblioItem = XMLUtilities.parseTEIBiblioItem((org.w3c.dom.Element) referenceInformation.getRight());
refText = refText.replaceAll("[\\[\\], ]+", "");

biblioRefMap.put(refText, biblioItem);
BiblioComponent biblioComponent = new BiblioComponent(biblioItem, Integer.parseInt(target.replace("b", "")));
biblioComponent.setRawForm(refText);
Expand All @@ -1999,8 +2000,6 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do

List<LayoutToken> allDocumentTokens = new ArrayList<>();

int startingOffset = 0;
List<Integer> sentenceOffsetStarts = new ArrayList<>();
for (DatasetDocumentSequence sequence : selectedSequences) {
List<LayoutToken> sentenceTokens = datastetAnalyzer.tokenizeWithLayoutToken(sequence.getText());
sequence.setTokens(sentenceTokens);
Expand Down Expand Up @@ -2028,34 +2027,21 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
// }
// });
//
int finalStartingOffset = startingOffset;
List<LayoutToken> sentenceTokenAllTokens = sentenceTokens.stream()
.map(lt -> {
lt.setOffset(lt.getOffset() + finalStartingOffset);
return lt;
})
.collect(Collectors.toList());
// int finalStartingOffset = startingOffset;
// List<LayoutToken> sentenceTokenAllTokens = sentenceTokens.stream()
// .map(lt -> {
// lt.setOffset(lt.getOffset() + finalStartingOffset);
// return lt;
// })
// .collect(Collectors.toList());

allDocumentTokens.addAll(sentenceTokenAllTokens);
sentenceOffsetStarts.add(startingOffset);
startingOffset += sequence.getText().length();
allDocumentTokens.addAll(sentenceTokens);
}

List<List<Dataset>> datasetLists = processing(selectedSequences, false);
List<List<Dataset>> datasetLists = processing(selectedSequences, disambiguate);

entities.addAll(datasetLists);

for (int i = 0; i < entities.size(); i++) {
List<Dataset> datasets = entities.get(i);
if (datasets == null) {
continue;
}
for (Dataset dataset : datasets) {
if (dataset == null)
continue;
dataset.setGlobalContextOffset(sentenceOffsetStarts.get(i));
}
}

// TODO make sure that selectedSequences == allSentences above in the processPDF?
List<String> allSentences = selectedSequences.stream().map(DatasetDocumentSequence::getText).toList();
Expand Down Expand Up @@ -2101,7 +2087,8 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
termPattern,
placeTaken.get(i),
frequencies,
sentenceOffsetStarts.get(i)
0
// sentenceOffsetStarts.get(i)
);
if (localEntities != null) {
Collections.sort(localEntities);
Expand Down Expand Up @@ -2154,7 +2141,7 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
// Enhance information in dataset entities
if (CollectionUtils.isNotEmpty(bibRefComponents)) {
// attach references to dataset entities
entities = attachRefBib(entities, bibRefComponents);
entities = attachRefBibSimple(entities, bibRefComponents);
}

// consolidate the attached ref bib (we don't consolidate all bibliographical references
Expand All @@ -2168,7 +2155,7 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
for (BiblioComponent bibRef : bibRefs) {
Integer refKeyVal = bibRef.getRefKey();
if (!consolidated.contains(refKeyVal)) {
BiblioItem biblioItem = biblioRefMap.get(refKeyVal);
BiblioItem biblioItem = biblioRefMap.get(String.valueOf(refKeyVal));
BibDataSet biblioDataSet = new BibDataSet();
biblioDataSet.setResBib(biblioItem);
citationsToConsolidate.add(biblioDataSet);
Expand All @@ -2179,19 +2166,21 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
}
}

try {
Consolidation consolidator = Consolidation.getInstance();
Map<Integer, BiblioItem> resConsolidation = consolidator.consolidate(citationsToConsolidate);
for (int j = 0; j < citationsToConsolidate.size(); j++) {
BiblioItem resCitation = citationsToConsolidate.get(j).getResBib();
BiblioItem bibo = resConsolidation.get(j);
if (bibo != null) {
BiblioItem.correct(resCitation, bibo);
if (StringUtils.isNotBlank(datastetConfiguration.getGluttonHost())) {
try {
Consolidation consolidator = Consolidation.getInstance();
Map<Integer, BiblioItem> resConsolidation = consolidator.consolidate(citationsToConsolidate);
for (int j = 0; j < citationsToConsolidate.size(); j++) {
BiblioItem resCitation = citationsToConsolidate.get(j).getResBib();
BiblioItem bibo = resConsolidation.get(j);
if (bibo != null) {
BiblioItem.correct(resCitation, bibo);
}
}
} catch (Exception e) {
throw new GrobidException(
"An exception occurred while running consolidation on bibliographical references.", e);
}
} catch (Exception e) {
throw new GrobidException(
"An exception occured while running consolidation on bibliographical references.", e);
}

// propagate the bib. ref. to the entities corresponding to the same dataset name without bib. ref.
Expand Down Expand Up @@ -2230,8 +2219,7 @@ public Pair<List<List<Dataset>>, List<BibDataSet>> processTEIDocument(org.w3c.do
entities = DatasetContextClassifier.getInstance(datastetConfiguration)
.classifyDocumentContexts(entities);

List<BibDataSet> resCitations = List.of();
return Pair.of(entities, resCitations);
return Pair.of(entities, citationsToConsolidate);
}

private static String normalize(String text) {
Expand Down Expand Up @@ -2355,10 +2343,11 @@ public static boolean checkDASAnnex(List<LayoutToken> annexTokens) {
return false;
}

/**
* Try to attach relevant bib ref component to dataset entities
*/
public List<List<Dataset>> attachRefBib(List<List<Dataset>> entities, List<BiblioComponent> refBibComponents) {
return attachRefBib(entities, refBibComponents, 5);
}

public List<List<Dataset>> attachRefBib(List<List<Dataset>> entities, List<BiblioComponent> refBibComponents, int distance) {

// we anchor the process to the dataset names and aggregate other closest components on the right
// if we cross a bib ref component we attach it, if a bib ref component is just after the last
Expand Down Expand Up @@ -2387,7 +2376,7 @@ public List<List<Dataset>> attachRefBib(List<List<Dataset>> entities, List<Bibli
for (BiblioComponent refBib : refBibComponents) {
//System.out.println(refBib.getOffsetStart() + " - " + refBib.getOffsetStart());
if ((refBib.getOffsetStart() >= pos) &&
(refBib.getOffsetStart() <= endPos + 5)) {
(refBib.getOffsetStart() <= endPos + distance)) {
entity.addBibRef(refBib);
endPos = refBib.getOffsetEnd();
}
Expand All @@ -2398,6 +2387,42 @@ public List<List<Dataset>> attachRefBib(List<List<Dataset>> entities, List<Bibli
return entities;
}

/**
* Try to attach relevant bib ref component to dataset entities, this does not use the global offset as in the
* TEI all references' offsets are local to the sentence
*/
public List<List<Dataset>> attachRefBibSimple(List<List<Dataset>> entities, List<BiblioComponent> refBibComponents) {
return attachRefBib(entities, refBibComponents, 5);
}

public List<List<Dataset>> attachRefBibSimple(List<List<Dataset>> entities, List<BiblioComponent> refBibComponents, int distance) {

// we anchor the process to the dataset names and aggregate other closest components on the right
// if we cross a bib ref component we attach it, if a bib ref component is just after the last
// component of the entity group, we attach it
for (List<Dataset> datasets : entities) {
for (Dataset entity : datasets) {
if (entity.getDatasetName() == null)
continue;

// find the name component and the offset
DatasetComponent nameComponent = entity.getDatasetName();
int pos = nameComponent.getOffsetEnd();

// find included or just next bib ref callout
List<BiblioComponent> relatedReferences = refBibComponents.stream()
.filter(ref -> ref.getOffsetStart() >= pos && ref.getOffsetEnd() <= pos + distance)
.collect(Collectors.toList());

if (CollectionUtils.isNotEmpty(relatedReferences)) {
entity.setBibRefs(relatedReferences);
}
}
}

return entities;
}

public List<List<OffsetPosition>> preparePlaceTaken(List<List<Dataset>> entities) {
List<List<OffsetPosition>> localPositions = new ArrayList<>();
for (List<Dataset> datasets : entities) {
Expand Down Expand Up @@ -2690,7 +2715,7 @@ public List<Dataset> propagateLayoutTokenSequence(DatasetDocumentSequence sequen
entity.getSequenceIdentifiers().addAll(name.getSequenceIdentifiers());
//entity.setType(DatastetLexicon.Dataset_Type.DATASET);
entity.setPropagated(true);
entity.setGlobalContextOffset(sentenceOffsetStart);
// entity.setGlobalContextOffset(sentenceOffsetStart);
if (entities == null)
entities = new ArrayList<>();
entities.add(entity);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import org.apache.commons.collections4.CollectionUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.tuple.Pair;
import org.grobid.core.data.BibDataSet;
Expand Down Expand Up @@ -338,7 +339,7 @@ public static Response processDatasetJATS(final InputStream inputStream,
json.append(", \"md5\": \"" + md5Str + "\"");
json.append(", \"mentions\":[");

if (extractedEntities != null && extractedEntities.size()>0) {
if (CollectionUtils.isNotEmpty(extractedEntities)) {
boolean startList = true;
for(List<Dataset> results : extractedEntities) {
for(Dataset dataset : results) {
Expand All @@ -353,12 +354,12 @@ public static Response processDatasetJATS(final InputStream inputStream,

json.append("], \"references\":[");

// if (extractionResult != null) {
// List<BibDataSet> bibDataSet = extractionResult.getRight();
// if (bibDataSet != null && bibDataSet.size()>0) {
// DatastetServiceUtils.serializeReferences(json, bibDataSet, extractedEntities);
// }
// }
if (CollectionUtils.isNotEmpty(extractedEntities)) {
List<BibDataSet> bibDataSet = extractionResult.getRight();
if (CollectionUtils.isNotEmpty(bibDataSet)) {
DatastetServiceUtils.serializeReferences(json, bibDataSet, extractedEntities);
}
}

json.append("]");

Expand Down Expand Up @@ -442,7 +443,7 @@ public static Response processDatasetTEI(final InputStream inputStream,
String md5Str = DatatypeConverter.printHexBinary(digest).toUpperCase();
json.append(", \"md5\": \"" + md5Str + "\"");
json.append(", \"mentions\":[");
if (extractedEntities != null && extractedEntities.size()>0) {
if (CollectionUtils.isNotEmpty(extractedEntities)) {
boolean startList = true;
for(List<Dataset> results : extractedEntities) {
for(Dataset dataset : results) {
Expand All @@ -454,14 +455,15 @@ public static Response processDatasetTEI(final InputStream inputStream,
}
}
}
json.append("], \"references\":[]");

// if (extractionResult != null) {
// List<BibDataSet> bibDataSet = extractionResult.getRight();
// if (bibDataSet != null && bibDataSet.size()>0) {
// DatastetServiceUtils.serializeReferences(json, bibDataSet, extractedEntities);
// }
// }
json.append("], \"references\":[");

if (CollectionUtils.isNotEmpty(extractedEntities)) {
List<BibDataSet> bibDataSet = extractionResult.getRight();
if (CollectionUtils.isNotEmpty(bibDataSet)) {
DatastetServiceUtils.serializeReferences(json, bibDataSet, extractedEntities);
}
}
json.append("]");

float runtime = ((float)(end-start)/1000);
json.append(", \"runtime\": "+ runtime);
Expand Down